提交 · 70f23a807dcab4ee3467c70ba2c7be69026f0a67 · openanolis / cloud-kernel

06 12月, 2016 25 次提交

由 David S. Miller 提交于 12月 05, 2016

Florian Fainelli says:

====================
net: ethoc: Misc improvements

This patch series fixes/improves a few things:

- implement a proper PHYLIB adjust_link callback to set the duplex mode
  accordingly
- do not open code the fetching of a MAC address in OF/DT environments
- demote an error message that occurs more frequently than expected in low
  CPU/memory/bandwidth environments

Tested on a Cirrus Logic EP93xx / TS7300 board.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70f23a80

net: ethoc: Demote packet dropped error message to debug · 38b4bc20

由 Florian Fainelli 提交于 12月 04, 2016

Spamming the console with: net eth1: packet dropped can happen
fairly frequently if the adapter is busy transmitting, demote the
message to a debug print.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NTobias Klauser <tklauser@distanz.ch>
Acked-by: NThierry Reding <thierry.reding@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38b4bc20

net: ethoc: Utilize of_get_mac_address() · b34296a9

由 Florian Fainelli 提交于 12月 04, 2016

Do not open code getting the MAC address exclusively from the
"local-mac-address" property, but instead use of_get_mac_address() which
looks up the MAC address using the 3 typical property names.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NTobias Klauser <tklauser@distanz.ch>
Acked-by: NThierry Reding <thierry.reding@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b34296a9

net: ethoc: Account for duplex changes · abf7e53e

由 Florian Fainelli 提交于 12月 04, 2016

ethoc_mdio_poll() which is our PHYLIB adjust_link callback does nothing,
we should at least react to duplex changes and change MODER accordingly.
Speed changes is not a problem, since the OpenCores Ethernet core seems
to be reacting okay without us telling it.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NTobias Klauser <tklauser@distanz.ch>
Acked-by: NThierry Reding <thierry.reding@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

abf7e53e

net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd

由 Eric Dumazet 提交于 12月 04, 2016

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c0d32fd

net/sched: cls_flower: Set the filter Hardware device for all use-cases · a6e16931

由 Hadar Hen Zion 提交于 12月 04, 2016

Check if the returned device from tcf_exts_get_dev function supports tc
offload and in case the rule can't be offloaded, set the filter hw_dev
parameter to the original device given by the user.

The filter hw_device parameter should always be set by fl_hw_replace_filter
function, since this pointer is used by dump stats and destroy
filter for each flower rule (offloaded or not).

Fixes: 7091d8c7 ('net/sched: cls_flower: Add offload support using egress Hardware device')
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reported-by: NSimon Horman <horms@verge.net.au>
Tested-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6e16931

ipv6: Allow IPv4-mapped address as next-hop · 96d5822c

由 Erik Nordmark 提交于 12月 03, 2016

Made kernel accept IPv6 routes with IPv4-mapped address as next-hop.

It is possible to configure IP interfaces with IPv4-mapped addresses, and
one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior
to this fix the kernel returned an EINVAL when attempting to add an IPv6
route with an IPv4-mapped address as a nexthop/gateway.

RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops,
thus in order to support that type of address configuration the kernel
needs to allow IPv4-mapped addresses as nexthops.
Signed-off-by: NErik Nordmark <nordmark@arista.com>
Signed-off-by: NBob Gilligan <gilligan@arista.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96d5822c

bpf: Preserve const register type on const OR alu ops · 3c839744

由 Gianluca Borello 提交于 12月 03, 2016

Occasionally, clang (e.g. version 3.8.1) translates a sum between two
constant operands using a BPF_OR instead of a BPF_ADD. The verifier is
currently not handling this scenario, and the destination register type
becomes UNKNOWN_VALUE even if it's still storing a constant. As a result,
the destination register cannot be used as argument to a helper function
expecting a ARG_CONST_STACK_*, limiting some use cases.

Modify the verifier to handle this case, and add a few tests to make sure
all combinations are supported, and stack boundaries are still verified
even with BPF_OR.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c839744

r8169: Add support for restarting auto-negotiation · f0903ea3

由 Florian Fainelli 提交于 12月 03, 2016

Implement ethtooll::nway_restart by utilizing mii_nway_restart.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0903ea3

Merge branch 'for-upstream' of... · c3543688

由 David S. Miller 提交于 12月 05, 2016

Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2016-12-03

Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10
kernel):

 - Fix for a potential NULL deref in the ieee802154 netlink code
 - Fix for the ED values of the at86rf2xx driver
 - Documentation updates to ieee802154
 - Cleanups to u8 vs __u8 usage
 - Timer API usage cleanups in HCI drivers

Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3543688

Merge branch 'tcp-tsq-perf' · 3f4888ad

由 David S. Miller 提交于 12月 05, 2016

Eric Dumazet says:

====================
tcp: tsq: performance series

Under very high TX stress, CPU handling NIC TX completions can spend
considerable amount of cycles handling TSQ (TCP Small Queues) logic.

This patch series avoids some atomic operations, but most notable
patch is the 3rd one, allowing other cpus processing ACK packets and
calling tcp_write_xmit() to grab TCP_TSQ_DEFERRED so that
tcp_tasklet_func() can skip already processed sockets.

This avoid lots of lock acquisitions and cache lines accesses,
particularly under load.

In v2, I added :

- tcp_small_queue_check() change to allow 1st and 2nd packets
  in write queue to be sent, even in the case TX completion of
    already acknowledged packets did not happen yet.
      This helps when TX completion coalescing parameters are set
        even to insane values, and/or busy polling is used.

- A reorganization of struct sock fields to
  lower false sharing and increase data locality.

- Then I moved tsq_flags from tcp_sock to struct sock also
  to reduce cache line misses during TX completions.

I measured an overall throughput gain of 22 % for heavy TCP use
over a single TX queue.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f4888ad

tcp: tsq: move tsq_flags close to sk_wmem_alloc · 7aa5470c

由 Eric Dumazet 提交于 12月 03, 2016

tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.

Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7aa5470c

net: reorganize struct sock for better data locality · 9115e8cd

由 Eric Dumazet 提交于 12月 03, 2016

Group fields used in TX path, and keep some cache lines mostly read
to permit sharing among cpus.

Gained two 4 bytes holes on 64bit arches.

Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
to speed up tcp_wfree() in the following patch.

I have not added ____cacheline_aligned_in_smp, this might be done later.
I prefer doing this once inet and tcp/udp sockets reorg is also done.

Tested with both TCP and UDP.

UDP receiver performance under flood increased by ~20 % :
Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
was moved away from a critical cache line, now mostly read and shared.

	/* --- cacheline 4 boundary (256 bytes) --- */
	unsigned int               sk_napi_id;           /* 0x100   0x4 */
	int                        sk_rcvbuf;            /* 0x104   0x4 */
	struct sk_filter *         sk_filter;            /* 0x108   0x8 */
	union {
		struct socket_wq * sk_wq;                /*         0x8 */
		struct socket_wq * sk_wq_raw;            /*         0x8 */
	};                                               /* 0x110   0x8 */
	struct xfrm_policy *       sk_policy[2];         /* 0x118  0x10 */
	struct dst_entry *         sk_rx_dst;            /* 0x128   0x8 */
	struct dst_entry *         sk_dst_cache;         /* 0x130   0x8 */
	atomic_t                   sk_omem_alloc;        /* 0x138   0x4 */
	int                        sk_sndbuf;            /* 0x13c   0x4 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	int                        sk_wmem_queued;       /* 0x140   0x4 */
	atomic_t                   sk_wmem_alloc;        /* 0x144   0x4 */
	long unsigned int          sk_tsq_flags;         /* 0x148   0x8 */
	struct sk_buff *           sk_send_head;         /* 0x150   0x8 */
	struct sk_buff_head        sk_write_queue;       /* 0x158  0x18 */
	__s32                      sk_peek_off;          /* 0x170   0x4 */
	int                        sk_write_pending;     /* 0x174   0x4 */
	long int                   sk_sndtimeo;          /* 0x178   0x8 */
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9115e8cd

tcp: tcp_mtu_probe() is likely to exit early · 12a59abc

由 Eric Dumazet 提交于 12月 03, 2016

Adding a likely() in tcp_mtu_probe() moves its code which used to
be inlined in front of tcp_write_xmit()

We still have a cache line miss to access icsk->icsk_mtup.enabled,
we will probably have to reorganize fields to help data locality.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12a59abc

tcp: tsq: add a shortcut in tcp_small_queue_check() · 75eefc6c

由 Eric Dumazet 提交于 12月 03, 2016

Always allow the two first skbs in write queue to be sent,
regardless of sk_wmem_alloc/sk_pacing_rate values.

This helps a lot in situations where TX completions are delayed either
because of driver latencies or softirq latencies.

Test is done with no cache line misses.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75eefc6c

tcp: tsq: avoid one atomic in tcp_wfree() · a9b204d1

由 Eric Dumazet 提交于 12月 03, 2016

Under high load, tcp_wfree() has an atomic operation trying
to schedule a tasklet over and over.

We can schedule it only if our per cpu list was empty.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a9b204d1

tcp: tsq: add shortcut in tcp_tasklet_func() · b223feb9

由 Eric Dumazet 提交于 12月 03, 2016

Under high stress, I've seen tcp_tasklet_func() consuming
~700 usec, handling ~150 tcp sockets.

By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
for other cpus/threads entering tcp_write_xmit() to grab it,
allowing tcp_tasklet_func() to skip sockets that already did
an xmit cycle.

In the future, we might give to ACK processing an increased
budget to reduce even more tcp_tasklet_func() amount of work.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b223feb9

tcp: tsq: remove one locked operation in tcp_wfree() · 408f0a6c

由 Eric Dumazet 提交于 12月 03, 2016

Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED
bits, use one cmpxchg() to perform a single locked operation.

Since the following patch will also set TCP_TSQ_DEFERRED here,
this cmpxchg() will make this addition free.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

408f0a6c

tcp: tsq: add tsq_flags / tsq_enum · 40fc3423

由 Eric Dumazet 提交于 12月 03, 2016

This is a cleanup, to ease code review of following patches.

Old 'enum tsq_flags' is renamed, and a new enumeration is added
with the flags used in cmpxchg() operations as opposed to
single bit operations.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40fc3423

Merge branch 'bnxt_en-dcbnl' · f83e8303

由 David S. Miller 提交于 12月 05, 2016

Michael Chan says:

====================
bnxt_en: Add DCBNL support.

This series adds DCBNL operations to support host-based IEEE DCBX.

v2: Updated to the latest firmware interface spec.

David, please consider this series for net-next.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f83e8303

bnxt_en: Add PFC statistics. · c77192f2

由 Michael Chan 提交于 12月 02, 2016

Report PFC statistics to ethtool -S and DCBNL.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c77192f2

bnxt_en: Implement DCBNL to support host-based DCBX. · 7df4ae9f

由 Michael Chan 提交于 12月 02, 2016

Support only IEEE DCBX initially.  Add IEEE DCBNL ops and functions to
get and set the hardware DCBX parameters.  The DCB code is conditional on
Kconfig CONFIG_BNXT_DCB.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7df4ae9f

bnxt_en: Update firmware header file to latest 1.6.0. · 87c374de

由 Michael Chan 提交于 12月 02, 2016

Latest interface has the latest DCB command structs.  Get and store the
max number of lossless TCs the hardware can support.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87c374de

bnxt_en: Re-factor bnxt_setup_tc(). · c5e3deb8

由 Michael Chan 提交于 12月 02, 2016

Add a new function bnxt_setup_mq_tc() to handle MQPRIO. This new function
will be called during ETS setup when we add DCBNL in the next patch.
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5e3deb8

net: phy: dp83848: Support ethernet pause frames · c7a61319

由 Jesper Nilsson 提交于 12月 02, 2016

According to the documentation, the PHYs supported by this driver
can also support pause frames. Announce this to be so.
Tested with a TI83822I.
Acked-by: NAndrew F. Davis <afd@ti.com>
Signed-off-by: NJesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7a61319

04 12月, 2016 15 次提交

ipv6 addrconf: Implemented enhanced DAD (RFC7527) · adc176c5

由 Erik Nordmark 提交于 12月 02, 2016

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.
Signed-off-by: NErik Nordmark <nordmark@arista.com>
Signed-off-by: NBob Gilligan <gilligan@arista.com>
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

adc176c5

Merge branch 'mv88e6390-batch-three' · ce84c7c6

由 David S. Miller 提交于 12月 03, 2016

Andrew Lunn says:

====================
mv88e6390 batch 3

More patches to support the MV88e6390. This is mostly refactoring
existing code and adding implementations for the mv88e6390.  This
patchset set which reserved frames are sent to the cpu, the size of
jumbo frames that will be accepted, turn off egress rate limiting, and
configuration of pause frames.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce84c7c6

net: dsa: mv88e6xxx: Implement mv88e6390 pause control · 3ce0e65e

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 has a number flow control registers accessed via the
Flow Control register. Use these to set the pause control.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ce0e65e

net: dsa: mv88e6xxx: Refactor pause configuration · b35d322a

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 has a different mechanism for configuring pause.
Refactor the code into an ops function, and for the moment, don't add
any mv88e6390 code yet.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b35d322a

net: dsa: mv88e6xxx: Refactor egress rate limiting · ef70b111

由 Andrew Lunn 提交于 12月 03, 2016

There are two different rate limiting configurations, depending on the
switch generation. Refactor this into ops.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef70b111

net: dsa: mv88e6xxx: Refactor setting of jumbo frames · 5f436666

由 Andrew Lunn 提交于 12月 03, 2016

Some switches support jumbo frames. Refactor this code into operations
in the ops structure.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f436666

net: dsa: mv88e6xxx: Reserved Management frames to CPU · 6e55f698

由 Andrew Lunn 提交于 12月 03, 2016

Older devices have a couple of registers in global2. The mv88e6390
family has a single register in global1 behind which hides similar
configuration. Implement and op for this.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e55f698

Merge branch 'mv88e6390-batch-two' · 7a6c5cb9

由 David S. Miller 提交于 12月 03, 2016

Andrew Lunn says:

====================
MV88E6390 batch two

This is the second batch of patches adding support for the
MV88e6390. They are not sufficient to make it work properly.

The mv88e6390 has a much expanded set of priority maps. Refactor the
existing code, and implement basic support for the new device.

Similarly, the monitor control register has been reworked.

The mv88e6390 has something odd in its EDSA tagging implementation,
which means it is not possible to use it. So we need to use DSA
tagging. This is the first device with EDSA support where we need to
use DSA, and the code does not support this. So two patches refactor
the existing code. The two different register definitions are
separated out, and using DSA on an EDSA capable device is added.

v2:
Add port prefix
Add helper function for 6390
Add _IEEE_ into #defines
Split monitor_ctrl into a number of separate ops.
Remove 6390 code which is management, used in a later patch
s/EGREES/EGRESS/.
Broke up setup_port_dsa() and set_port_dsa() into a number of ops

v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a6c5cb9

net: dsa: mv88e6xxx: Refactor CPU and DSA port setup · 56995cbc

由 Andrew Lunn 提交于 12月 03, 2016

Older chips only support DSA tagging. Newer chips have both DSA and
EDSA tagging. Refactor the code by adding port functions for setting the
frame mode, egress mode, and if to forward unknown frames.

This results in the helper mv88e6xxx_6065_family() becoming unused, so
remove it.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56995cbc

net: dsa: mv88e6xxx: Move the tagging protocol into info · 443d5a1b

由 Andrew Lunn 提交于 12月 03, 2016

Older chips support a single tagging protocol, DSA. New chips support
both DSA and EDSA, an enhanced version. Having both as an option
changes the register layouts. Up until now, it has been assumed that
if EDSA is supported, it will be used. Hence the register layout has
been determined by which protocol should be used. However, mv88e6390
has a different implementation of EDSA, which requires we need to use
the DSA tagging. Hence separate the selection of the protocol from the
register layout.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

443d5a1b

net: dsa: mv88e6xxx: Monitor and Management tables · 33641994

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 changes the monitor control register into the Monitor
and Management control, which is an indirection register to various
registers.

Add ops to set the CPU port and the ingress/egress port for both
register layouts, to global1
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33641994

net: dsa: mv88e6xxx: Implement mv88e6390 tag remap · ef0a7318

由 Andrew Lunn 提交于 12月 03, 2016

The mv88e6390 does not have the two registers to set the frame
priority map. Instead it has an indirection registers for setting a
number of different priority maps. Refactor the old code into an
function, implement the mv88e6390 version, and use an op to call the
right one.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef0a7318

Merge branch 'fib-notifier-event-replay' · 69248719

由 David S. Miller 提交于 12月 03, 2016

Jiri Pirko says:

====================
ipv4: fib: Replay events when registering FIB notifier

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by dumping the FIB tables and
replaying notifications previously sent in the chain for the registered
notification block.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first eight patches.

The ninth patch adds a change sequence counter to ensure the integrity
of the FIB dump. The last patch adds the dump itself to the FIB chain
registration function and modifies existing listeners to pass a callback
to be executed in case dump was inconsistent.

---
v3->v4:
- Register the notification block after the dump and protect it using
  the change sequence counter (Hannes Frederic Sowa).
- Since we now integrate the dump into the registration function, drop
  the sysctl to set maximum number of retries and instead set it to a
  fixed number. Lets see if it's really a problem before adding something
  we can never remove.
- For the same reason, dump FIB tables for all net namespaces.
- Add a comment regarding guarantees provided by mutex semantics.

v2->v3:
- Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
- Read the sequence counter under RTNL to ensure synchronization
  between the dump process and other processes changing the routing
  tables (Hannes Frederic Sowa).
- Pass a callback to the dump function to be executed prior to a retry.
- Limit the dump to a single net namespace.

v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
  (David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
  ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
  to the FIB notification chain prior to ports creation.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69248719

ipv4: fib: Replay events when registering FIB notifier · c3852ef7

由 Ido Schimmel 提交于 12月 03, 2016

Commit b90eb754 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by dumping the FIB tables and replaying the events to the
passed notification block. The dump itself is done using RCU in order
not to starve consumers that need RTNL to make progress.

The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump under RTNL. This allows us to avoid
the problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.

Callers of the registration function may pass a callback that is
executed in case the dump was inconsistent with current FIB tables.

The number of retries until a consistent dump is achieved is set to a
fixed number to prevent callers from looping for long periods of time.
In case current limit proves to be problematic in the future, it can be
easily converted to be configurable using a sysctl.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3852ef7

ipv4: fib: Allow for consistent FIB dumping · cacaad11

由 Ido Schimmel 提交于 12月 03, 2016

The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.

Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cacaad11

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功