提交 · a257b94a18f7eb60bbe9b5fd415d208ac71d49ea · openanolis / cloud-kernel

30 4月, 2016 27 次提交

net/mlx5: Set number of allowed levels in priority · a257b94a

由 Maor Gottlieb 提交于 4月 29, 2016

Refactors the flow steering namespace creation,
by changing the name num_fts to num_levels.
When new flow table is created, the driver assign new level
to this flow table therefore the meaning is equivalent.
Since downstream patches will introduce the ability to create more
than one flow table per level, the name num_fts is no
longer accurate.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a257b94a

net/mlx5: Introduce modify flow rule destination · d745098c

由 Maor Gottlieb 提交于 4月 29, 2016

This API is used for modifying the flow rule destination.
This is needed for modifying the pointed flow table by the
traffic type classifier rules to point on the aRFS tables.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d745098c

net/mlx5e: Direct TIR per RQ · 1da36696

由 Tariq Toukan 提交于 4月 29, 2016

Introduce new TIRs for direct access per RQ.
Now we have 2 available kinds of TIRs:
	- indirect TIR per traffic type, each points to one RQT (RSS RQT)
          same as before.
	- New direct TIR per RQ, each points to RQT with a size of one
          that forwards packets to that RQ only.

Driver will open max channels (num cores) direct TIRs by default,
they will be filled with the actual RQs once channels are allocated.

Needed for downstream aRFS and ethtool direct steering functionalities.
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1da36696

net/mlx5e: Call vxlan_get_rx_port() with rtnl lock · 01a14098

由 Matthew Finlay 提交于 4月 29, 2016

Hold the rtnl lock when calling vxlan_get_rx_port().

Fixes: b7aade15 ("vxlan: break dependency with netdev drivers")
Signed-off-by: NMatthew Finlay <matt@mellanox.com>
Reported-by: NAlexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01a14098

Merge branch 'enc28j60-small-improvements' · 4b2523c1

由 David S. Miller 提交于 4月 29, 2016

Michael Heimpold says:

====================
net: ethernet: enc28j60: small improvements

This series of two patches adds the following improvements to the driver:

1) Rework the central SPI read function so that it is compatible with
   SPI masters which only support half duplex transfers.

2) Add a device tree binding for the driver.

Changelog:

v3: * renamed and improved binding documentation as
      suggested by Rob Herring

v2: * took care of Arnd Bergmann's review comments
      - allow to specify MAC address via DT
      - unconditionally define DT id table
    * increased the driver version minor number
    * driver author's email address bounces, removed from address list

v1: * Initial submission
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b2523c1

net: ethernet: enc28j60: add device tree support · 2dd355a0

由 Michael Heimpold 提交于 4月 28, 2016

The following patch adds the required match table for device tree support
(and while at, fix the indent). It's also possible to specify the
MAC address in the DT blob.

Also add the corresponding binding documentation file.
Signed-off-by: NMichael Heimpold <mhei@heimpold.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2dd355a0

net: ethernet: enc28j60: support half-duplex SPI controllers · 2957a28a

由 Michael Heimpold 提交于 4月 28, 2016

The current spi_read_buf function fails on SPI host masters which
are only half-duplex capable. Splitting the Tx and Rx part solves
this issue.

Tested on Raspberry Pi (full duplex) and I2SE Duckbill (half duplex).
Signed-off-by: NMichael Heimpold <mhei@heimpold.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2957a28a

net: constify is_skb_forwardable's arguments · f4b05d27

由 Nikolay Aleksandrov 提交于 4月 28, 2016

is_skb_forwardable is not supposed to change anything so constify its
arguments
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4b05d27

Merge branch 'ppp-rtnetlink' · 92aff96a

由 David S. Miller 提交于 4月 29, 2016

Guillaume Nault says:

====================
ppp: add rtnetlink support

PPP devices lack the ability to be customised at creation time. In
particular they can't be created in a given netns or with a particular
name. Moving or renaming the device after creation is possible, but
creates undesirable transient effects on servers where PPP devices are
constantly created and removed, as users connect and disconnect.
Implementing rtnetlink support solves this problem.

The rtnetlink handlers implemented in this series are minimal, and can
only replace the PPPIOCNEWUNIT ioctl. The rest of PPP ioctls remains
necessary for any other operation on channels and units.
It is perfectly possible to mix PPP devices created by rtnl
and by ioctl(PPPIOCNEWUNIT). Devices will behave in the same way.

mutex_trylock() is used to resolve the locking issue wrt. locking
dependency between rtnl_lock() and ppp_mutex (see ppp_nl_newlink() in
patch #2).

A user visible difference brought by this series is that old PPP
interfaces (those created with ioctl(PPPIOCNEWUNIT)), can now be
removed by "ip link del", just like new rtnl based PPP devices.

Changes since v3:
  - Rebase on net-next.
  - Not an RFC anymore.

Changes since v2:
  - Define ->rtnl_link_ops for ioctl based PPP devices, so they can
    handle rtnl messages just like rtnl based ones (suggested by
    Stephen Hemminger).
  - Move back to original lock ordering between ppp_mutex and rtnl_lock
    to simplify patch series. Handle lock inversion issue using
    mutex_trylock() (suggested by Stephen Hemminger).
  - Do file descriptor lookup directly in ppp_nl_newlink(), to simplify
    ppp_dev_configure().

Changes since v1:
  - Rebase on net-next.
  - Invert locking order wrt. ppp_mutex and rtnl_lock and protect
    file->private_data with ppp_mutex.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92aff96a

ppp: add rtnetlink device creation support · 96d934c7

由 Guillaume Nault 提交于 4月 28, 2016

Define PPP device handler for use with rtnetlink.
The only PPP specific attribute is IFLA_PPP_DEV_FD. It is mandatory and
contains the file descriptor of the associated /dev/ppp instance (the
file descriptor which would have been used for ioctl(PPPIOCNEWUNIT) in
the ioctl-based API). The PPP device is removed when this file
descriptor is released (same behaviour as with ioctl based PPP
devices).

PPP devices created with the rtnetlink API behave like the ones created
with ioctl(PPPIOCNEWUNIT). In particular existing ioctls work the same
way, no matter how the PPP device was created.
The rtnl callbacks are also assigned to ioctl based PPP devices. This
way, rtnl messages have the same effect on any PPP devices.
The immediate effect is that all PPP devices, even ioctl-based
ones, can now be removed with "ip link del".

A minor difference still exists between ioctl and rtnl based PPP
interfaces: in the device name, the number following the "ppp" prefix
corresponds to the PPP unit number for ioctl based devices, while it is
just an unrelated incrementing index for rtnl ones.
Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96d934c7

ppp: define reusable device creation functions · 7d9f0b48

由 Guillaume Nault 提交于 4月 28, 2016

Move PPP device initialisation and registration out of
ppp_create_interface().
This prepares code for device registration with rtnetlink.

While there, simplify the prototype of ppp_create_interface():

  * Since ppp_dev_configure() takes care of setting file->private_data,
    there's no need to return a ppp structure to ppp_unattached_ioctl()
    anymore.

  * The unit parameter is made read/write so that ppp_create_interface()
    can tell which unit number has been assigned.
Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d9f0b48

net: ethernet: stmmac: update MDIO support for GMAC4 · ac1f74a7

由 Alexandre TORGUE 提交于 4月 28, 2016

On new GMAC4 IP, MAC_MDIO_address register has been updated, and bitmaps
changed. This patch takes into account those changes.
Signed-off-by: NAlexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac1f74a7

vxlan: fix initialization with custom link parameters · 65226ef8

由 Jiri Benc 提交于 4月 28, 2016

Commit 0c867c9b ("vxlan: move Ethernet initialization to a separate
function") changed initialization order and as an unintended result, when the
user specifies additional link parameters (such as IFLA_ADDRESS) while
creating vxlan interface, those are overwritten by vxlan_ether_setup later.

It's necessary to call ether_setup from withing the ->setup callback. That
way, the correct parameters are set by rtnl_create_link later. This is done
also for VXLAN-GPE, as we don't know the interface type yet at that point,
and changed to the correct interface type later.

Fixes: 0c867c9b ("vxlan: move Ethernet initialization to a separate function")
Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Tested-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65226ef8

Merge branch 'samples-bpf-user-experience' · 638af178

由 David S. Miller 提交于 4月 29, 2016

Jesper Dangaard Brouer says:

====================
samples/bpf: Improve user experience

It is a steep learning curve getting started with using the eBPF
examples in samples/bpf/.  There are several dependencies, and
specific versions of these dependencies.  Invoking make in the correct
manor is also slightly obscure.

This patchset cleanup, document and hopefully improves the first time
user experience with the eBPF samples directory by auto-detecting
certain scenarios.

V4:
 - Address Naveen's nitpicks
 - Handle/fail if extra args are passed in LLC or CLANG (David Laight)

V3:
 - Add Alexei's ACKs
 - Remove README paragraph about LLVM experimental BPF target
   as it only existed between LLVM version 3.6 to 3.7.

V2:
 - Adjusted recommend minimum versions to 3.7.1
 - Included clang build instructions
 - New patch adding CLANG variable and validation of command
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

638af178

samples/bpf: like LLC also verify and allow redefining CLANG command · bdefbbf2

由 Jesper Dangaard Brouer 提交于 4月 28, 2016

Users are likely to manually compile both LLVM 'llc' and 'clang'
tools.  Thus, also allow redefining CLANG and verify command exist.

Makefile implementation wise, the target that verify the command have
been generalized.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bdefbbf2

samples/bpf: allow make to be run from samples/bpf/ directory · b62a796c

由 Jesper Dangaard Brouer 提交于 4月 28, 2016

It is not intuitive that 'make' must be run from the top level
directory with argument "samples/bpf/" to compile these eBPF samples.

Introduce a kbuild make file trick that allow make to be run from the
"samples/bpf/" directory itself.  It basically change to the top level
directory and call "make samples/bpf/" with the "/" slash after the
directory name.

Also add a clean target that only cleans this directory, by taking
advantage of the kbuild external module setting M=$PWD.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b62a796c

samples/bpf: add a README file to get users started · 1c97566d

由 Jesper Dangaard Brouer 提交于 4月 28, 2016

Getting started with using examples in samples/bpf/ is not
straightforward.  There are several dependencies, and specific
versions of these dependencies.

Just compiling the example tool is also slightly obscure, e.g. one
need to call make like:

 make samples/bpf/

Do notice the "/" slash after the directory name.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c97566d

samples/bpf: Makefile verify LLVM compiler avail and bpf target is supported · 7b01dd57

由 Jesper Dangaard Brouer 提交于 4月 28, 2016

Make compiling samples/bpf more user friendly, by detecting if LLVM
compiler tool 'llc' is available, and also detect if the 'bpf' target
is available in this version of LLVM.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b01dd57

samples/bpf: add back functionality to redefine LLC command · 6ccfba75

由 Jesper Dangaard Brouer 提交于 4月 28, 2016

It is practical to be-able-to redefine the location of the LLVM
command 'llc', because not all distros have a LLVM version with bpf
target support.  Thus, it is sometimes required to compile LLVM from
source, and sometimes it is not desired to overwrite the distros
default LLVM version.

This feature was removed with 128d1514 ("samples/bpf: Use llc in
PATH, rather than a hardcoded value").

Add this features back. Note that it is possible to redefine the LLC
on the make command like:

 make samples/bpf/ LLC=~/git/llvm/build/bin/llc

Fixes: 128d1514 ("samples/bpf: Use llc in PATH, rather than a hardcoded value")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ccfba75

Merge branch 'cxgb4-mbox-cmd-logging' · c23846c1

由 David S. Miller 提交于 4月 29, 2016

Hariprasad Shenai says:

====================
cxgb4/cxgb4vf: add support for mbox cmd logging

This patch series adds support for logging mailbox commands and
replies for debugging purpose for both PF and VF driver.

This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly
review the change and let us know in case of any review comments.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c23846c1

cxgb4vf: Add support to enable logging of firmware mailbox commands for VF · ae7b7576

由 Hariprasad Shenai 提交于 4月 28, 2016

Add new /sys/kernel/debug/ support to dump firmware mailbox commands
and replies for debugging purpose.

Based on original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae7b7576

cxgb4: Add support to enable logging of firmware mailbox commands · 7f080c3f

由 Hariprasad Shenai 提交于 4月 28, 2016

Add new /sys/kernel/debug/ support to dump a firmware mailbox command
issued and replies for debugging purpose.

Based on original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f080c3f

Merge branch 'hns-props' · 482f13aa

由 David S. Miller 提交于 4月 29, 2016

Yisen Zhuang says:

====================
net: hns: update DT properties according to Rob's comments

There are some inappropriate properties definition in hns DT. We
update the definition according to Rob's review comments and fix some
typos in binding.

For more details, please see individual patches.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

482f13aa

dts: hisi: update hns dst for changing property port-id to reg · ea991027

由 Yisen.Zhuang\(Zhuangyuzeng\) 提交于 4月 28, 2016

Indexes should generally be avoided. This patch changes property port-id
to reg in dsaf port node.
Signed-off-by: NYisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea991027

Documentation: Bindings: Update DT binding for hns dsaf node · a1ecde2c

由 Yisen.Zhuang\(Zhuangyuzeng\) 提交于 4月 28, 2016

This patch changes property port-id to reg in dsaf port node,
removes property cpld-ctrl-reg, and fixes some typos.
Signed-off-by: NYisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a1ecde2c

net: hns: change port-id property to reg property in dsaf port node · 0211b8fb

由 Yisen.Zhuang\(Zhuangyuzeng\) 提交于 4月 28, 2016

Indexes should generally be avoided. So we use reg rather than port-id to
index ports.
Signed-off-by: NYisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0211b8fb

net: hns: remove cpld-ctrl-reg and add cell in the cpld-syscon property · 1ffdfac9

由 Yisen.Zhuang\(Zhuangyuzeng\) 提交于 4月 28, 2016

Because cpld-ctrl-reg property is offset base on cpld-syscon property,
we make it as a cell in the cpld-syscon property.
Signed-off-by: NYisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ffdfac9

29 4月, 2016 13 次提交

ipvlan: Fix failure path in dev registration during link creation · 494e8489

由 Mahesh Bandewar 提交于 4月 27, 2016

When newlink creation fails at device-registration, the port->count
is decremented twice. Francesco Ruggeri (fruggeri@arista.com) found
this issue in Macvlan and the same exists in IPvlan driver too.

While fixing this issue I noticed another issue of missing unregister
in case of failure, so adding it to the fix which is similar to the
macvlan fix by Francesco in commit 30837960 ("macvlan: fix failure
during registration v3")
Reported-by: NFrancesco Ruggeri <fruggeri@arista.com>
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

494e8489

pch_gbe: replace private tx ring lock with common netif_tx_lock · 222e4d0b

由 françois romieu 提交于 4月 27, 2016

pch_gbe_tx_ring.tx_lock is only used in the hard_xmit handler and
in the transmit completion reaper called from NAPI context.

Compile-tested only. Potential victims Cced.

Someone more knowledgeable may check if pch_gbe_tx_queue could
have some use for a mmiowb.
Signed-off-by: NFrancois Romieu <romieu@fr.zoreil.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Cress <andy.cress@us.kontron.com>
Cc: bryan@fossetcon.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

222e4d0b

net: dsa: Provide CPU port statistics to master netdev · badf3ada

由 Florian Fainelli 提交于 4月 27, 2016

This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).

We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.

We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.

The new ethtool -S <iface> output would therefore look like this now:
<iface> statistics
p<2 digits cpu port number>_<switch MIB counter names>
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

badf3ada

tcp: give prequeue mode some care · 0cef6a4c

由 Eric Dumazet 提交于 4月 27, 2016

TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.

Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.

Current prequeue implementation has following issues :

1) It only checks size of the prequeue against sk_rcvbuf

   It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
   But we now have ~8MB values to cope with modern networking needs.
   We have to add sk_rmem_alloc in the equation, since out of order
   packets can definitely use up to sk_rcvbuf memory themselves.

2) Even with a fixed memory truesize check, prequeue can be filled
   by thousands of packets. When prequeue needs to be flushed, either
   from sofirq context (in tcp_prequeue() or timer code), or process
   context (in tcp_prequeue_process()), this adds a latency spike
   which is often not desirable.
   I added a fixed limit of 32 packets, as this translated to a max
   flush time of 60 us on my test hosts.

   Also note that all packets in prequeue are not accounted for tcp_mem,
   since they are not charged against sk_forward_alloc at this point.
   This is probably not a big deal.

Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0cef6a4c

fq: split out backlog update logic · b43e7199

由 Michal Kazior 提交于 4月 27, 2016

mac80211 (which will be the first user of the
fq.h) recently started to support software A-MSDU
aggregation. It glues skbuffs together into a
single one so the backlog accounting needs to be
more fine-grained.

To avoid backlog sorting logic duplication split
it up for re-use.
Signed-off-by: NMichal Kazior <michal.kazior@tieto.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b43e7199

tipc: remove an unnecessary NULL check · b4358657

由 Dan Carpenter 提交于 4月 27, 2016

This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4358657

net/mlx5e: avoid stack overflow in mlx5e_open_channels · 6b87663f

由 Arnd Bergmann 提交于 4月 26, 2016

struct mlx5e_channel_param is a large structure that is allocated
on the stack of mlx5e_open_channels, and with a recent change
it has grown beyond the warning size for the maximum stack
that a single function should use:

mellanox/mlx5/core/en_main.c: In function 'mlx5e_open_channels':
mellanox/mlx5/core/en_main.c:1325:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

The function is already using dynamic allocation and is not in
a fast path, so the easiest workaround is to use another kzalloc
for allocating the channel parameters.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: d3c9bc27 ("net/mlx5e: Added ICO SQs")
Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b87663f

tuntap: calculate rps hash only when needed · 3df97ba8

由 Jason Wang 提交于 4月 25, 2016

There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).

Before:
~1150000 pps
After:
~1300000 pps

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3df97ba8

Merge branch 'tcp-eor' · f345c9a5

由 David S. Miller 提交于 4月 28, 2016

Martin KaFai Lau says:

====================
tcp: Make use of MSG_EOR in tcp_sendmsg

v4:
~ Do not set eor bit in do_tcp_sendpages() since there is
  no way to pass MSG_EOR from the userland now.
~ Avoid rmw by testing MSG_EOR first in tcp_sendmsg().
~ Move TCP_SKB_CB(skb)->eor test to a new helper
  tcp_skb_can_collapse_to() (suggested by Soheil).
~ Add some packetdrill tests.

v3:
~ Separate EOR marking from the SKBTX_ANY_TSTAMP logic.
~ Move the eor bit test back to the loop in tcp_sendmsg and
  tcp_sendpage because there could be >1 threads doing
  sendmsg.
~ Thanks to Eric Dumazet's suggestions on v2.
~ The TCP timestamp bug fixes are separated into other threads.

v2:
~ Rework based on the recent work
  "add TX timestamping via cmsg" by
  Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
~ This version takes the MSG_EOR bit as a signal of
  end-of-response-message and leave the selective
  timestamping job to the cmsg
~ Changes based on the v1 feedback (like avoid
  unlikely check in a loop and adding tcp_sendpage
  support)
~ The first 3 patches are bug fixes.  The fixes in this
  series depend on the newly introduced txstamp_ack in
  net-next.  I will make relevant patches against net after
  getting some feedback.
~ The test results are based on the recently posted net fix:
  "tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks"

One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).

One of our use case is at the webserver.  The webserver tracks
the HTTP2 response latency by measuring when the webserver sends
the first byte to the socket till the TCP ACK of the last byte
is received.  In the cases where we don't have client side
measurement, measuring from the server side is the only option.
In the cases we have the client side measurement, the server side
data can also be used to justify/cross-check-with the client
side data.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f345c9a5

tcp: Handle eor bit when fragmenting a skb · a166140e

由 Martin KaFai Lau 提交于 4月 25, 2016

When fragmenting a skb, the next_skb should carry
the eor from prev_skb.  The eor of prev_skb should
also be reset.

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
0.200 sendto(4, ..., 730, 0, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > . 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a166140e

tcp: Handle eor bit when coalescing skb · a643b5d4

由 Martin KaFai Lau 提交于 4月 25, 2016

This patch:
1. Prevent next_skb from coalescing to the prev_skb if
   TCP_SKB_CB(prev_skb)->eor is set
2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
   allowed

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
0.300 > P. 1:731(730) ack 1
0.300 > P. 731:1461(730) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a643b5d4

tcp: Make use of MSG_EOR in tcp_sendmsg · c134ecb8

由 Martin KaFai Lau 提交于 4月 25, 2016

This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg.  The eor bit
will prevent data from appending to that skb in the future.

The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.

This patch handles the tcp_sendmsg case.  The followup patches
will handle other skb coalescing and fragment cases.

One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: NEric Dumazet <edumazet@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c134ecb8

Merge branch 'tcp-redundant-checks' · 2a9e8438

由 David S. Miller 提交于 4月 28, 2016

Soheil Hassas Yeganeh says:

====================
tcp: simplify ack tx timestamps

v2:
- Fully remove SKBTX_ACK_TSTAMP, as suggested by Willem de Bruijn.

This patch series aims at removing redundant checks and fields
for ack timestamps for TCP.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a9e8438

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功