提交 · 9115e8cd2a0c6eaaa900c462721f12e1d45f326c · openeuler / Kernel

06 12月, 2016 13 次提交

net: reorganize struct sock for better data locality · 9115e8cd

由 Eric Dumazet 提交于 12月 03, 2016

Group fields used in TX path, and keep some cache lines mostly read
to permit sharing among cpus.

Gained two 4 bytes holes on 64bit arches.

Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
to speed up tcp_wfree() in the following patch.

I have not added ____cacheline_aligned_in_smp, this might be done later.
I prefer doing this once inet and tcp/udp sockets reorg is also done.

Tested with both TCP and UDP.

UDP receiver performance under flood increased by ~20 % :
Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
was moved away from a critical cache line, now mostly read and shared.

	/* --- cacheline 4 boundary (256 bytes) --- */
	unsigned int               sk_napi_id;           /* 0x100   0x4 */
	int                        sk_rcvbuf;            /* 0x104   0x4 */
	struct sk_filter *         sk_filter;            /* 0x108   0x8 */
	union {
		struct socket_wq * sk_wq;                /*         0x8 */
		struct socket_wq * sk_wq_raw;            /*         0x8 */
	};                                               /* 0x110   0x8 */
	struct xfrm_policy *       sk_policy[2];         /* 0x118  0x10 */
	struct dst_entry *         sk_rx_dst;            /* 0x128   0x8 */
	struct dst_entry *         sk_dst_cache;         /* 0x130   0x8 */
	atomic_t                   sk_omem_alloc;        /* 0x138   0x4 */
	int                        sk_sndbuf;            /* 0x13c   0x4 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	int                        sk_wmem_queued;       /* 0x140   0x4 */
	atomic_t                   sk_wmem_alloc;        /* 0x144   0x4 */
	long unsigned int          sk_tsq_flags;         /* 0x148   0x8 */
	struct sk_buff *           sk_send_head;         /* 0x150   0x8 */
	struct sk_buff_head        sk_write_queue;       /* 0x158  0x18 */
	__s32                      sk_peek_off;          /* 0x170   0x4 */
	int                        sk_write_pending;     /* 0x174   0x4 */
	long int                   sk_sndtimeo;          /* 0x178   0x8 */
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9115e8cd

tcp: tcp_mtu_probe() is likely to exit early · 12a59abc

由 Eric Dumazet 提交于 12月 03, 2016

Adding a likely() in tcp_mtu_probe() moves its code which used to
be inlined in front of tcp_write_xmit()

We still have a cache line miss to access icsk->icsk_mtup.enabled,
we will probably have to reorganize fields to help data locality.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12a59abc

tcp: tsq: add a shortcut in tcp_small_queue_check() · 75eefc6c

由 Eric Dumazet 提交于 12月 03, 2016

Always allow the two first skbs in write queue to be sent,
regardless of sk_wmem_alloc/sk_pacing_rate values.

This helps a lot in situations where TX completions are delayed either
because of driver latencies or softirq latencies.

Test is done with no cache line misses.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75eefc6c

tcp: tsq: avoid one atomic in tcp_wfree() · a9b204d1

由 Eric Dumazet 提交于 12月 03, 2016

Under high load, tcp_wfree() has an atomic operation trying
to schedule a tasklet over and over.

We can schedule it only if our per cpu list was empty.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a9b204d1

tcp: tsq: add shortcut in tcp_tasklet_func() · b223feb9

由 Eric Dumazet 提交于 12月 03, 2016

Under high stress, I've seen tcp_tasklet_func() consuming
~700 usec, handling ~150 tcp sockets.

By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
for other cpus/threads entering tcp_write_xmit() to grab it,
allowing tcp_tasklet_func() to skip sockets that already did
an xmit cycle.

In the future, we might give to ACK processing an increased
budget to reduce even more tcp_tasklet_func() amount of work.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b223feb9

tcp: tsq: remove one locked operation in tcp_wfree() · 408f0a6c

由 Eric Dumazet 提交于 12月 03, 2016

Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED
bits, use one cmpxchg() to perform a single locked operation.

Since the following patch will also set TCP_TSQ_DEFERRED here,
this cmpxchg() will make this addition free.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

408f0a6c

tcp: tsq: add tsq_flags / tsq_enum · 40fc3423

由 Eric Dumazet 提交于 12月 03, 2016

This is a cleanup, to ease code review of following patches.

Old 'enum tsq_flags' is renamed, and a new enumeration is added
with the flags used in cmpxchg() operations as opposed to
single bit operations.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40fc3423

Merge branch 'bnxt_en-dcbnl' · f83e8303

由 David S. Miller 提交于 12月 05, 2016

Michael Chan says:

====================
bnxt_en: Add DCBNL support.

This series adds DCBNL operations to support host-based IEEE DCBX.

v2: Updated to the latest firmware interface spec.

David, please consider this series for net-next.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f83e8303

bnxt_en: Add PFC statistics. · c77192f2

由 Michael Chan 提交于 12月 02, 2016

Report PFC statistics to ethtool -S and DCBNL.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c77192f2

bnxt_en: Implement DCBNL to support host-based DCBX. · 7df4ae9f

由 Michael Chan 提交于 12月 02, 2016

Support only IEEE DCBX initially.  Add IEEE DCBNL ops and functions to
get and set the hardware DCBX parameters.  The DCB code is conditional on
Kconfig CONFIG_BNXT_DCB.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7df4ae9f

bnxt_en: Update firmware header file to latest 1.6.0. · 87c374de

由 Michael Chan 提交于 12月 02, 2016

Latest interface has the latest DCB command structs.  Get and store the
max number of lossless TCs the hardware can support.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87c374de

bnxt_en: Re-factor bnxt_setup_tc(). · c5e3deb8

由 Michael Chan 提交于 12月 02, 2016

Add a new function bnxt_setup_mq_tc() to handle MQPRIO. This new function
will be called during ETS setup when we add DCBNL in the next patch.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5e3deb8

net: phy: dp83848: Support ethernet pause frames · c7a61319

由 Jesper Nilsson 提交于 12月 02, 2016

According to the documentation, the PHYs supported by this driver
can also support pause frames. Announce this to be so.
Tested with a TI83822I.
Acked-by: NAndrew F. Davis <afd@ti.com>
Signed-off-by: NJesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7a61319

04 12月, 2016 27 次提交

ipv6 addrconf: Implemented enhanced DAD (RFC7527) · adc176c5

由 Erik Nordmark 提交于 12月 02, 2016

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.
Signed-off-by: NErik Nordmark <nordmark@arista.com>
Signed-off-by: NBob Gilligan <gilligan@arista.com>
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

adc176c5

Merge branch 'mv88e6390-batch-three' · ce84c7c6

由 David S. Miller 提交于 12月 03, 2016

Andrew Lunn says:

====================
mv88e6390 batch 3

More patches to support the MV88e6390. This is mostly refactoring
existing code and adding implementations for the mv88e6390.  This
patchset set which reserved frames are sent to the cpu, the size of
jumbo frames that will be accepted, turn off egress rate limiting, and
configuration of pause frames.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce84c7c6

net: dsa: mv88e6xxx: Implement mv88e6390 pause control · 3ce0e65e

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 has a number flow control registers accessed via the
Flow Control register. Use these to set the pause control.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ce0e65e

net: dsa: mv88e6xxx: Refactor pause configuration · b35d322a

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 has a different mechanism for configuring pause.
Refactor the code into an ops function, and for the moment, don't add
any mv88e6390 code yet.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b35d322a

net: dsa: mv88e6xxx: Refactor egress rate limiting · ef70b111

由 Andrew Lunn 提交于 12月 03, 2016

There are two different rate limiting configurations, depending on the
switch generation. Refactor this into ops.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef70b111

net: dsa: mv88e6xxx: Refactor setting of jumbo frames · 5f436666

由 Andrew Lunn 提交于 12月 03, 2016

Some switches support jumbo frames. Refactor this code into operations
in the ops structure.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f436666

net: dsa: mv88e6xxx: Reserved Management frames to CPU · 6e55f698

由 Andrew Lunn 提交于 12月 03, 2016

Older devices have a couple of registers in global2. The mv88e6390
family has a single register in global1 behind which hides similar
configuration. Implement and op for this.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e55f698

Merge branch 'mv88e6390-batch-two' · 7a6c5cb9

由 David S. Miller 提交于 12月 03, 2016

Andrew Lunn says:

====================
MV88E6390 batch two

This is the second batch of patches adding support for the
MV88e6390. They are not sufficient to make it work properly.

The mv88e6390 has a much expanded set of priority maps. Refactor the
existing code, and implement basic support for the new device.

Similarly, the monitor control register has been reworked.

The mv88e6390 has something odd in its EDSA tagging implementation,
which means it is not possible to use it. So we need to use DSA
tagging. This is the first device with EDSA support where we need to
use DSA, and the code does not support this. So two patches refactor
the existing code. The two different register definitions are
separated out, and using DSA on an EDSA capable device is added.

v2:
Add port prefix
Add helper function for 6390
Add _IEEE_ into #defines
Split monitor_ctrl into a number of separate ops.
Remove 6390 code which is management, used in a later patch
s/EGREES/EGRESS/.
Broke up setup_port_dsa() and set_port_dsa() into a number of ops

v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a6c5cb9

net: dsa: mv88e6xxx: Refactor CPU and DSA port setup · 56995cbc

由 Andrew Lunn 提交于 12月 03, 2016

Older chips only support DSA tagging. Newer chips have both DSA and
EDSA tagging. Refactor the code by adding port functions for setting the
frame mode, egress mode, and if to forward unknown frames.

This results in the helper mv88e6xxx_6065_family() becoming unused, so
remove it.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56995cbc

net: dsa: mv88e6xxx: Move the tagging protocol into info · 443d5a1b

由 Andrew Lunn 提交于 12月 03, 2016

Older chips support a single tagging protocol, DSA. New chips support
both DSA and EDSA, an enhanced version. Having both as an option
changes the register layouts. Up until now, it has been assumed that
if EDSA is supported, it will be used. Hence the register layout has
been determined by which protocol should be used. However, mv88e6390
has a different implementation of EDSA, which requires we need to use
the DSA tagging. Hence separate the selection of the protocol from the
register layout.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

443d5a1b

net: dsa: mv88e6xxx: Monitor and Management tables · 33641994

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 changes the monitor control register into the Monitor
and Management control, which is an indirection register to various
registers.

Add ops to set the CPU port and the ingress/egress port for both
register layouts, to global1
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33641994

net: dsa: mv88e6xxx: Implement mv88e6390 tag remap · ef0a7318

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 does not have the two registers to set the frame
priority map. Instead it has an indirection registers for setting a
number of different priority maps. Refactor the old code into an
function, implement the mv88e6390 version, and use an op to call the
right one.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef0a7318

Merge branch 'fib-notifier-event-replay' · 69248719

由 David S. Miller 提交于 12月 03, 2016

Jiri Pirko says:

====================
ipv4: fib: Replay events when registering FIB notifier

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by dumping the FIB tables and
replaying notifications previously sent in the chain for the registered
notification block.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first eight patches.

The ninth patch adds a change sequence counter to ensure the integrity
of the FIB dump. The last patch adds the dump itself to the FIB chain
registration function and modifies existing listeners to pass a callback
to be executed in case dump was inconsistent.

---
v3->v4:
- Register the notification block after the dump and protect it using
  the change sequence counter (Hannes Frederic Sowa).
- Since we now integrate the dump into the registration function, drop
  the sysctl to set maximum number of retries and instead set it to a
  fixed number. Lets see if it's really a problem before adding something
  we can never remove.
- For the same reason, dump FIB tables for all net namespaces.
- Add a comment regarding guarantees provided by mutex semantics.

v2->v3:
- Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
- Read the sequence counter under RTNL to ensure synchronization
  between the dump process and other processes changing the routing
  tables (Hannes Frederic Sowa).
- Pass a callback to the dump function to be executed prior to a retry.
- Limit the dump to a single net namespace.

v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
  (David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
  ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
  to the FIB notification chain prior to ports creation.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69248719

ipv4: fib: Replay events when registering FIB notifier · c3852ef7