提交 · 0aea76d35c9651d55bbaf746e7914e5f9ae5a25d · openeuler / raspberrypi-kernel

26 4月, 2016 13 次提交

tcp: SYN packets are now simply consumed · 0aea76d3

由 Eric Dumazet 提交于 4月 21, 2016

We now have proper per-listener but also per network namespace counters
for SYN packets that might be dropped.

We replace the kfree_skb() by consume_skb() to be drop monitor [1]
friendly, and remove an obsolete comment.
FastOpen SYN packets can carry payload in them just fine.

[1] perf record -a -g -e skb:kfree_skb sleep 1; perf report
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0aea76d3

Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 1bc7fe64

由 David S. Miller 提交于 4月 25, 2016

Jeff Kirsher says:

====================
10GbE Intel Wired LAN Driver Updates 2016-04-25

This series contains updates to ixgbe and ixgbevf.

Emil provides several patches, starting with the consolidation of the
logic behind configuring spoof checking.  Fixed an issue which was
causing link issues for backplane devices because x550em_a/x devices
did not have a default value for mac->ops.setup_link.  Refactored the
ethtool stats to bring the logic closer to how ixgbe handles stats and
sets up per-queue stats for ixgbevf.

Mark adds a new register to wait for previous register writes to complete
before issuing a register read, which is needed when slower links are
in use.  Fixed the flow control setup for x550em_a, the incorrect
fc_setup function was being used.

Don added a workaround for empty SFP+ cage crosstalk, since on some
systems the crosstalk could lead to link flap on empty SFP+ cages.

Jake converts ixgbe and ixgbevf to use the BIT() macro.

Alex Duyck adds support for partial GSO segmentation in the case of
tunnels for ixgbe and ixgbevf.  Then preps for HyperV by moving the API
negotiation into mac_ops.

Arnd Bergmann provides a fix for the ARM compile warnings in linux-next
by converting the use of a udelay() to msleep().
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bc7fe64

Merge branch 'nla_align-set-2' · e7157f28

由 David S. Miller 提交于 4月 25, 2016

Nicolas Dichtel says:

====================
netlink: align attributes when needed (patchset #2)

This is the continuation (series #2) of the work done to align netlink
attributes when these attributes contain some 64-bit fields.

In patch #3, I didn't modify the function ila_encap_nlsize(). I was waiting
feedback for this patch: http://patchwork.ozlabs.org/patch/613766/
If it's approved, there will be an update to switch nla_total_size() to
nla_total_size_64bit() after the merge of net in net-next.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7157f28

wireless: use nla_put_u64_64bit() · 2dad624e

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2dad624e

netfilter/ipvs: use nla_put_u64_64bit() · cbdeafd7

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cbdeafd7

ieee802154: use nla_put_u64_64bit() · a558da09

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a558da09

l2tp: use nla_put_u64_64bit() · 1c714a92

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c714a92

bridge: use nla_put_u64_64bit() · 12a0faa3

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12a0faa3

ovs: use nla_put_u64_64bit() · 0238b720

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0238b720

ipv6: use nla_put_u64_64bit() · f13a82d8

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f13a82d8

sched: use nla_put_u64_64bit() · 2a51c1e8

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a51c1e8

rtnl: use nla_put_u64_64bit() · 343a6d8e

由 Nicolas Dichtel 提交于 4月 25, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

343a6d8e

soreuseport: Resolve merge conflict for v4/v6 ordering fix · d296ba60

由 Craig Gallek 提交于 4月 25, 2016

d894ba18 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0c ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d296ba60

25 4月, 2016 22 次提交

sock: relax WARN_ON() in sock_owned_by_user() · 5e91f6ce

由 Eric Dumazet 提交于 4月 25, 2016

Valdis reported tons of stack dumps caused by WARN_ON() in
sock_owned_by_user()

This test needs to be relaxed if/when lockdep disables itself.

Note that other lockdep_sock_is_held() callers are all from
rcu_dereference_protected() sections which already are disabled
if/when lockdep has been disabled.

Fixes: fafc4e1e ("sock: tigthen lockdep checks for sock_owned_by_user")
Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e91f6ce

ixgbe: use msleep for long delays · d4f90d9d

由 Arnd Bergmann 提交于 4月 16, 2016

The newly added x550em_a support causes a link failure on ARM because of
an overly long time passed into udelay():

ERROR: "__bad_udelay" [drivers/net/ethernet/intel/ixgbe/ixgbe.ko] undefined!

There are multiple variants of the ixgbe_acquire_swfw_sync_*() function,
and the other ones all use msleep(), so we can safely assume that all
callers are allowed to sleep, which makes msleep() a better replacement
than mdelay().
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: 49425dfc ("ixgbe: Add support for x550em_a 10G MAC type")
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

d4f90d9d

ixgbevf: Move API negotiation function into mac_ops · 7921f4dc

由 Alexander Duyck 提交于 4月 14, 2016

This patch moves API negotiation into mac_ops. The general idea here is
that with HyperV on the way we need to make certain that anything that will
have different versions between HyperV and a standard VF needs to be
abstracted enough so that we can have a separate function between the two
so we can avoid changes in one breaking something in the other.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

7921f4dc

ixgbe/ixgbevf: Add support for GSO partial · b83e3010

由 Alexander Duyck 提交于 4月 14, 2016

This patch adds support for partial GSO segmentation in the case of
tunnels. Specifically with this change the driver an perform segmentation
as long as the frame either has IPv6 inner headers, or we are allowed to
mangle the IP IDs on the inner header. This is needed because we will not
be modifying any fields from the start of the start of the outer transport
header to the start of the inner transport header as we are treating them
like they are just a block of IP options.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

b83e3010

ixgbevf: make use of BIT() macro to avoid shift of signed values · 8d055cc0

由 Jacob Keller 提交于 4月 13, 2016

Also cleanup a case where we're bit shifting a value into place, and use
an unsigned constant. Make use of the unsigned postfix in places where
BIT() macro is not appropriate.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

8d055cc0

ixgbe: resolve shift of negative value warning · 3e973dc4

由 Jacob Keller 提交于 4月 13, 2016

Make use of GENMASK instead of open coding the equivalent operation
incorrectly.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

3e973dc4

ixgbe: use BIT() macro · b4f47a48

由 Jacob Keller 提交于 4月 13, 2016

Several areas of ixgbe were written before widespread usage of the
BIT(n) macro. With the impending release of GCC 6 and its associated new
warnings, some usages such as (1 << 31) have been noted within the ixgbe
driver source. Fix these wholesale and prevent future issues by simply
using BIT macro instead of hand coded bit shifts.

Also fix a few shifts that are shifting values into place by using the
'u' prefix to indicate unsigned. It doesn't strictly matter in these
cases because we're not shifting by too large a value, but these are all
unsigned values and should be indicated as such.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

b4f47a48

ixgbe: Add work around for empty SFP+ cage crosstalk · 4319a797

由 Don Skidmore 提交于 4月 12, 2016

It is possible on some systems that crosstalk could lead to link flap
on empty SFP+ cages.  A new NVM bit was defined to let SW know it
needs to implement the work around which consists of verifying that
there is a module in the cage before acting on the LSC.
Signed-off-by: NDon Skidmore <donald.c.skidmore@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

4319a797

ixgbe: Use correct FC setup function for x550em_a · a0254a70

由 Mark Rustad 提交于 4月 08, 2016

Somehow the wrong fc_setup function was used for x550em_a, so
correct that. Also set setup_link to NULL as its value is
determined later, just like it is with X550EM_x.
Signed-off-by: NMark Rustad <mark.d.rustad@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

a0254a70

ixgbevf: add support for per-queue ethtool stats · a02a5a53

由 Emil Tantilov 提交于 4月 07, 2016

Implement per-queue statistics for packets, bytes and busy poll
specific counters.
Signed-off-by: NEmil Tantilov <emil.s.tantilov@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

a02a5a53

ixgbevf: refactor ethtool stats handling · d72d6c19

由 Emil Tantilov 提交于 4月 07, 2016

This brings the logic closer to how we handle the stats in ixgbe and it
sets us up for introducing per-queue stats.

Use IXGBEVF_STAT and IXGBEVF_NETDEV_STAT for accessing the driver and
netdev stats respectively. This way we don't have to calculate the
stats based on register values which could lead to the counters not
being initialized properly when the interface is down.

IXGBEVF_QUEUE_STATS_LEN is set to include the number of queues.

Also some defines were renamed to use the IXGBEVF prefix.
Signed-off-by: NEmil Tantilov <emil.s.tantilov@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

d72d6c19

ixgbe: Add register wait for slow links · 2f2219be

由 Mark Rustad 提交于 4月 07, 2016

Use a new register to wait for previous register writes to complete
before issuing a register read. This is needed when slower links
are in use.
Signed-off-by: NMark Rustad <mark.d.rustad@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

2f2219be

hv_netvsc: Fix the list processing for network change event · 15cfd407

由 Haiyang Zhang 提交于 4月 21, 2016

RNDIS_STATUS_NETWORK_CHANGE event is handled as two "half events" --
media disconnect & connect. The second half should be added to the list
head, not to the tail. So all events are processed in normal order.
Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: NK. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15cfd407

ixgbe: make 'action' field in struct ixgbe_fdir_filter a u64 value · 2a9ed5d1

由 Sridhar Samudrala 提交于 4月 01, 2016

This field is used to record the RX queue index for a redirect action
passed via ring_cookie field in struct ethtool_rx_flow_spec which is
a u64 value.

For ex: after adding a filter rule to redirect to a VF using ethtool
  # echo 4 > /sys/class/net/p4p1/device/sriov_numvfs
  # ethtool -N p4p1 flow-type ip4 src-ip 192.168.0.1 action 0x100000000

querying for the rule shows the Action as 'Direct to queue 0'

  # ethtool -n p4p1
  4 RX rings available
  Total 1 rules

  Filter: 2045
 	Rule Type: Raw IPv4
	Src IP addr: 192.168.0.1 mask: 0.0.0.0
	Dest IP addr: 0.0.0.0 mask: 255.255.255.255
	TOS: 0x0 mask: 0xff
	Protocol: 0 mask: 0xff
	L4 bytes: 0x0 mask: 0xffffffff
	VLAN EtherType: 0x0 mask: 0xffff
	VLAN: 0x0 mask: 0xffff
	User-defined: 0x0 mask: 0xffffffffffffffff
	Action: Direct to queue 0

With this fix, ethtool will report the right queue index even for VFs.
	Action: Direct to queue 4294967296

Here 4294967296 corresponds to 0x100000000.
We need to update 'ethtool' to report the queue index as a Hex value so
that it is more  user friendly and matches with the 'action' value that
is passed when adding the rule.
Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

2a9ed5d1

ixgbe: fix default mac->ops.setup_link for X550EM · 4695886c

由 Emil Tantilov 提交于 3月 24, 2016

X550EM_a/x did not have a default value for mac->ops.setup_link which
was causing link issues for backplane devices.

This patch sets mac->ops.setup_link to ixgbe_setup_mac_link_X540 for
X550EM_a/x which is also default for X550. This will result in
mac->ops.setup_link calling the link setup function for the respective
PHY type in case we do not need a special function to deal with it.
Reported-by: NKen Cox <jkc@redhat.com>
Signed-off-by: NEmil Tantilov <emil.s.tantilov@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

4695886c

ixgbe: set VLAN spoof checking unconditionally · d3dec7c7

由 Emil Tantilov 提交于 3月 18, 2016

Previously the PF driver would only set VLAN spoof checking if
the VF had created VLANs. This was done by setting and checking
a counter (vlan_count) whenever a VLAN was created by the VF.
However it is possible for the vlan_count to be !=0 while there are
no VLANs assigned to the VF due to the count incrementing every
time a VLAN 0 is added on ifdown/up, which resulted in VLAN spoofing
always being set for those VFs.

This patch cleans up the logic by unconditionally setting VLAN based on
how the VF is configured (via ip link set ethX vf Y spoofchk on/off).
This change also resolves an issue where the VLAN spoofing can remain
set even after being disabled by the user due to the driver enabling
VLAN spoof checking every time a VLAN is added to the VF, but would
only allow changes in the setting if vlan_count != 0.

Also default_vf_vlan_id and vlans_enabled were removed from the
vf_data_storage structure since they are not being used in the driver.
Signed-off-by: NEmil Tantilov <emil.s.tantilov@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

d3dec7c7

ixgbe: consolidate the configuration of spoof checking · 77f192af

由 Emil Tantilov 提交于 3月 18, 2016

Consolidate the logic behind configuring spoof checking:

Move the setting of the MAC, VLAN and Ethertype spoof checking into
ixgbe_ndo_set_vf_spoofchk().

Change ixgbe_set_mac_anti_spoofing() to set MAC spoofing per VF similar
to the VLAN and Ethertype functions - this allows us to call the helper
functions in ixgbe_ndo_set_vf_spoofchk() for all spoof check types and
only disable MAC spoof checking when creating MACVLAN.
Signed-off-by: NEmil Tantilov <emil.s.tantilov@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

77f192af

tcp-tso: do not split TSO packets at retransmit time · 10d3be56

由 Eric Dumazet 提交于 4月 21, 2016

Linux TCP stack painfully segments all TSO/GSO packets before retransmits.

This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.

Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
 - Less memory overhead, because write queues have less skbs
 - Less cpu overhead at ACK processing.
 - Better SACK processing, as lot of studies mentioned how
   awful linux was at this ;)
 - Less cpu overhead to send the rtx packets
   (IP stack traversal, netfilter traversal, drivers...)
 - Better latencies in presence of losses.
 - Smaller spikes in fq like packet schedulers, as retransmits
   are not constrained by TCP Small Queues.

1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10d3be56

tipc: fix stale links after re-enabling bearer · 8cee83dd

由 Parthasarathy Bhuvaragan 提交于 4月 21, 2016

Commit 42b18f60 ("tipc: refactor function tipc_link_timeout()"),
introduced a bug which prevents sending of probe messages during
link synchronization phase. This leads to hanging links, if the
bearer is disabled/enabled after links are up.

In this commit, we send the probe messages correctly.

Fixes: 42b18f60 ("tipc: refactor function tipc_link_timeout()")
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8cee83dd

Merge branch 'tcp-tcstamp_ack-frag-coalesce' · 6a74c196

由 David S. Miller 提交于 4月 24, 2016

Martin KaFai Lau says:

====================
tcp: Handle txstamp_ack when fragmenting/coalescing skbs

This patchset is to handle the txstamp-ack bit when
fragmenting/coalescing skbs.

The second patch depends on the recently posted series
for the net branch:
"tcp: Merge timestamp info when coalescing skbs"

A BPF prog is used to kprobe to sock_queue_err_skb()
and print out the value of serr->ee.ee_data.  The BPF
prog (run-able from bcc) is attached here:

BPF prog used for testing:
~~~~~

from __future__ import print_function
from bcc import BPF

bpf_text = """

int trace_err_skb(struct pt_regs *ctx)
{
	struct sk_buff *skb = (struct sk_buff *)ctx->si;
	struct sock *sk = (struct sock *)ctx->di;
	struct sock_exterr_skb *serr;
	u32 ee_data = 0;

	if (!sk || !skb)
		return 0;

	serr = SKB_EXT_ERR(skb);
	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
	bpf_trace_printk("ee_data:%u\\n", ee_data);

	return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a74c196

tcp: Merge txstamp_ack in tcp_skb_collapse_tstamp · 2de8023e

由 Martin KaFai Lau 提交于 4月 19, 2016

When collapsing skbs, txstamp_ack also needs to be merged.

Retrans Collapse Test:
~~~~~~
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

BPF Output Before:
~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~
<...>-2027  [007] d.s.    79.765921: : ee_data:1459

Sacks Collapse Test:
~~~~~
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

BPF Output Before:
~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~
<...>-2049  [007] d.s.    89.185538: : ee_data:14599
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2de8023e

tcp: Carry txstamp_ack in tcp_fragment_tstamp · b51e13fa

由 Martin KaFai Lau 提交于 4月 19, 2016

When a tcp skb is sliced into two smaller skbs (e.g. in
tcp_fragment() and tso_fragment()),  it does not carry
the txstamp_ack bit to the newly created skb if it is needed.
The end result is a timestamping event (SCM_TSTAMP_ACK) will
be missing from the sk->sk_error_queue.

This patch carries this bit to the new skb2
in tcp_fragment_tstamp().

BPF Output Before:
~~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~~
<...>-2050  [000] d.s.   100.928763: : ee_data:14599

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 14600) = 14600
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257

0.300 close(4) = 0
0.300 > F. 14601:14601(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 14602:14602(0) ack 2
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b51e13fa

24 4月, 2016 5 次提交

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 11afbff8

由 David S. Miller 提交于 4月 24, 2016

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:

1) Ensure we don't go over the ruleset blob boundaries in
   mark_source_chains().

2) Validate that target jumps land on an existing xt_entry. This extra
   sanitization comes with a performance penalty when loading the ruleset.

3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.

4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.

5) Make sure the minimal possible target size in x_tables.

6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.

7) Check that standard target size is valid.

8) More sanitization to ensure that the target_offset field is correct.

9) Add xt_check_entry_match() to validate that matches are well-formed.

10-12) Three patch to reduce the number of parameters in
    translate_compat_table() for {arp,ip,ip6}tables by using a container
    structure.

13) No need to return value from xt_compat_match_from_user(), so make
    it void.

14) Consolidate translate_table() so it can be used by compat code too.

15) Remove obsolete check for compat code, so we keep consistent with
    what was already removed in the native layout code (back in 2007).

16) Get rid of target jump validation from mark_source_chains(),
    obsoleted by #2.

17) Introduce xt_copy_counters_from_user() to consolidate counter
    copying, and use it from {arp,ip,ip6}tables.

18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
    functions.

19) Move nf_connlabel_match() to xt_connlabel.

20) Skip event notification if connlabel did not change.

21) Update of nf_connlabels_get() to make the upcoming nft connlabel
    support easier.

23) Remove spinlock to read protocol state field in conntrack.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11afbff8

Merge branch 'nla_align-more' · 8d9ea160

由 David S. Miller 提交于 4月 23, 2016

Nicolas Dichtel says:

====================
netlink: align attributes when needed (patchset #1)

This is the continuation of the work done to align netlink attributes
when these attributes contain some 64-bit fields.

David, if the third patch is too big (or maybe the series), I can split it.
Just tell me what you prefer.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d9ea160

taskstats: use the libnl API to align nlattr on 64-bit · 80df5542

由 Nicolas Dichtel 提交于 4月 22, 2016

Goal of this patch is to use the new libnl API to align netlink attribute
when needed.
The layout of the netlink message will be a bit different after the patch,
because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
attribute instead of before it.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80df5542

xfrm: align nlattr properly when needed · de95c4a4

由 Nicolas Dichtel 提交于 4月 22, 2016

Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de95c4a4

libnl: add nla_put_u64_64bit() helper · 73520786

由 Nicolas Dichtel 提交于 4月 22, 2016

With this function, nla_data() is aligned on a 64-bit area.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73520786