提交 · e227701c4583f0408cac33eca0fa96ac4b8ff7d9 · openeuler / Kernel

04 7月, 2019 11 次提交

Merge branch 'net-ICW-sendmsg-recvmsg' · e227701c

由 David S. Miller 提交于 7月 03, 2019

Paolo Abeni says:

====================
net: use ICW for sk_proto->{send,recv}msg

This series extends ICW usage to one of the few remaining spots in fast-path
still hitting per packet retpoline overhead, namely the sk_proto->{send,recv}msg
calls.

The first 3 patches in this series refactor the existing code so that applying
the ICW macros is straight-forward: we demux inet_{recv,send}msg in ipv4 and
ipv6 variants so that each of them can easily select the appropriate TCP or UDP
direct call. While at it, a new helper is created to avoid excessive code
duplication, and the current ICWs for inet_{recv,send}msg are adjusted
accordingly.

The last 2 patches really introduce the new ICW use-case, respectively for the
ipv6 and the ipv4 code path.

This gives up to 5% performance improvement under UDP flood, and smaller but
measurable gains for TCP RR workloads.

v1 -> v2:
 - drop inet6_{recv,send}msg declaration from header file,
   prefer ICW macro instead
 - avoid unneeded reclaration for udp_sendmsg, as suggested by Willem
====================
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e227701c

ipv4: use indirect call wrappers for {tcp, udp}_{recv, send}msg() · 6f24080e

由 Paolo Abeni 提交于 7月 03, 2019

This avoids an indirect call per syscall for common ipv4 transports

v1 -> v2:
 - avoid unneeded reclaration for udp_sendmsg, as suggested by Willem
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f24080e

ipv6: use indirect call wrappers for {tcp, udpv6}_{recv, send}msg() · 164c51fe

由 Paolo Abeni 提交于 7月 03, 2019

This avoids an indirect call per syscall for common ipv6 transports
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

164c51fe

net: adjust socket level ICW to cope with ipv6 variant of {recv, send}msg · a648a592

由 Paolo Abeni 提交于 7月 03, 2019

After the previous patch we have ipv{6,4} variants for {recv,send}msg,
we should use the generic _INET ICW variant to call into the proper
build-in.

This also allows dropping the now unused and rather ugly _INET4 ICW macro

v1 -> v2:
 - use ICW macro to declare inet6_{recv,send}msg
 - fix a couple of checkpatch offender in the code context
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a648a592

ipv6: provide and use ipv6 specific version for {recv, send}msg · 68ab5d14

由 Paolo Abeni 提交于 7月 03, 2019

This will simplify indirect call wrapper invocation in the following
patch.

No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.

SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.

The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.

v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
   prefer ICW macro instead
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68ab5d14

inet: factor out inet_send_prepare() · e4730936

由 Paolo Abeni 提交于 7月 03, 2019

The same code is replicated verbatim in multiple places, and the next
patches will introduce an additional user for it. Factor out a
helper and use it where appropriate. No functional change intended.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e4730936

qlcnic: remove redundant assignment to variable err · 2559d7c4

由 Colin Ian King 提交于 7月 03, 2019

The variable err is being initialized with a value that is never
read and it is being updated later with a new value. The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2559d7c4

atl1c: remove redundant assignment to variable tpd_req · b70d846c

由 Colin Ian King 提交于 7月 03, 2019

The variable tpd_req is being initialized with a value that is never
read and it is being updated later with a new value. The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b70d846c

qed: Add support for Timestamping the unicast PTP packets. · cedeac9d

由 Sudarsana Reddy Kalluru 提交于 7月 02, 2019

This patch adds driver changes to detect/timestamp the unicast PTP packets.

Changes from previous version:
-------------------------------
v2: Defined a macro for unicast ptp param mask.

Please consider applying this to "net-next".
Signed-off-by: NSudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: NAriel Elior <aelior@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cedeac9d

gve: Fix u64_stats_sync to initialize start · 3c13ce74

由 Catherine Sullivan 提交于 7月 02, 2019

u64_stats_fetch_begin needs to initialize start.
Signed-off-by: NCatherine Sullivan <csully@google.com>
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c13ce74

loopback: fix lockdep splat · d62962b3

由 Mahesh Bandewar 提交于 7月 02, 2019

dev_init_scheduler() and dev_activate() expect the caller to
hold RTNL. Since we don't want blackhole device to be initialized
per ns, we are initializing at init.

[    3.855027] Call Trace:
[    3.855034]  dump_stack+0x67/0x95
[    3.855037]  lockdep_rcu_suspicious+0xd5/0x110
[    3.855044]  dev_init_scheduler+0xe3/0x120
[    3.855048]  ? net_olddevs_init+0x60/0x60
[    3.855050]  blackhole_netdev_init+0x45/0x6e
[    3.855052]  do_one_initcall+0x6c/0x2fa
[    3.855058]  ? rcu_read_lock_sched_held+0x8c/0xa0
[    3.855066]  kernel_init_freeable+0x1e5/0x288
[    3.855071]  ? rest_init+0x260/0x260
[    3.855074]  kernel_init+0xf/0x180
[    3.855076]  ? rest_init+0x260/0x260
[    3.855078]  ret_from_fork+0x24/0x30

Fixes: 4de83b88 ("loopback: create blackhole net device similar to loopack.")
Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d62962b3

03 7月, 2019 12 次提交

mlxsw: spectrum_ptp: Fix validation in mlxsw_sp1_ptp_packet_finish() · dbcdb61a

由 Petr Machata 提交于 7月 02, 2019

Before mlxsw_sp1_ptp_packet_finish() sends the packet back, it validates
whether the corresponding port is still valid. However the condition is
incorrect: when mlxsw_sp_port == NULL, the code dereferences the port to
compare it to skb->dev.

The condition needs to check whether the port is present and skb->dev still
refers to that port (or else is NULL). If that does not hold, bail out.
Add a pair of parentheses to fix the condition.

Fixes: d92e4e6e ("mlxsw: spectrum: PTP: Support timestamping on Spectrum-1")
Reported-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NPetr Machata <petrm@mellanox.com>
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbcdb61a

r8169: add random MAC address fallback · c782e204

由 Heiner Kallweit 提交于 7月 02, 2019

It was reported that the GPD MicroPC is broken in a way that no valid
MAC address can be read from the network chip. The vendor driver deals
with this by assigning a random MAC address as fallback. So let's do
the same.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c782e204

Revert "r8169: improve handling VLAN tag" · 7424edbb

由 Heiner Kallweit 提交于 7月 02, 2019

This reverts commit 759d0957.

The patch was based on a misunderstanding. As Al Viro pointed out [0]
it's simply wrong on big endian. So let's revert it.

[0] https://marc.info/?t=156200975600004&r=1&w=2Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7424edbb

net: stmmac: make "snps,reset-delays-us" optional again · cc5e92c2

由 Martin Blumenstingl 提交于 7月 02, 2019

Commit 760f1dc2 ("net: stmmac: add sanity check to
device_property_read_u32_array call") introduced error checking of the
device_property_read_u32_array() call in stmmac_mdio_reset().
This results in the following error when the "snps,reset-delays-us"
property is not defined in devicetree:
  invalid property snps,reset-delays-us

This sanity check made sense until commit 84ce4d0f ("net: stmmac:
initialize the reset delay array") ensured that there are fallback
values for the reset delay if the "snps,reset-delays-us" property is
absent. That was at the cost of making that property mandatory though.

Drop the sanity check for device_property_read_u32_array() and thus make
the "snps,reset-delays-us" property optional again (avoiding the error
message while loading the stmmac driver with a .dtb where the property
is absent).

Fixes: 760f1dc2 ("net: stmmac: add sanity check to device_property_read_u32_array call")
Signed-off-by: NMartin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc5e92c2

bonding/main: fix NULL dereference in bond_select_active_slave() · b8bd72d3

由 Eric Dumazet 提交于 7月 01, 2019

A bonding master can be up while best_slave is NULL.

[12105.636318] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[12105.638204] mlx4_en: eth1: Linkstate event 1 -> 1
[12105.648984] IP: bond_select_active_slave+0x125/0x250
[12105.653977] PGD 0 P4D 0
[12105.656572] Oops: 0000 [#1] SMP PTI
[12105.660487] gsmi: Log Shutdown Reason 0x03
[12105.664620] Modules linked in: kvm_intel loop act_mirred uhaul vfat fat stg_standard_ftl stg_megablocks stg_idt stg_hdi stg elephant_dev_num stg_idt_eeprom w1_therm wire i2c_mux_pca954x i2c_mux mlx4_i2c i2c_usb cdc_acm ehci_pci ehci_hcd i2c_iimc mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core [last unloaded: kvm_intel]
[12105.685686] mlx4_core 0000:03:00.0: dispatching link up event for port 2
[12105.685700] mlx4_en: eth2: Linkstate event 2 -> 1
[12105.685700] mlx4_en: eth2: Link Up (linkstate)
[12105.724452] Workqueue: bond0 bond_mii_monitor
[12105.728854] RIP: 0010:bond_select_active_slave+0x125/0x250
[12105.734355] RSP: 0018:ffffaf146a81fd88 EFLAGS: 00010246
[12105.739637] RAX: 0000000000000003 RBX: ffff8c62b03c6900 RCX: 0000000000000000
[12105.746838] RDX: 0000000000000000 RSI: ffffaf146a81fd08 RDI: ffff8c62b03c6000
[12105.754054] RBP: ffffaf146a81fdb8 R08: 0000000000000001 R09: ffff8c517d387600
[12105.761299] R10: 00000000001075d9 R11: ffffffffaceba92f R12: 0000000000000000
[12105.768553] R13: ffff8c8240ae4800 R14: 0000000000000000 R15: 0000000000000000
[12105.775748] FS:  0000000000000000(0000) GS:ffff8c62bfa40000(0000) knlGS:0000000000000000
[12105.783892] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12105.789716] CR2: 0000000000000000 CR3: 0000000d0520e001 CR4: 00000000001626f0
[12105.796976] Call Trace:
[12105.799446]  [<ffffffffac31d387>] bond_mii_monitor+0x497/0x6f0
[12105.805317]  [<ffffffffabd42643>] process_one_work+0x143/0x370
[12105.811225]  [<ffffffffabd42c7a>] worker_thread+0x4a/0x360
[12105.816761]  [<ffffffffabd48bc5>] kthread+0x105/0x140
[12105.821865]  [<ffffffffabd42c30>] ? rescuer_thread+0x380/0x380
[12105.827757]  [<ffffffffabd48ac0>] ? kthread_associate_blkcg+0xc0/0xc0
[12105.834266]  [<ffffffffac600241>] ret_from_fork+0x51/0x60

Fixes: e2a7420d ("bonding/main: convert to using slave printk macros")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJohn Sperbeck <jsperbeck@google.com>
Cc: Jarod Wilson <jarod@redhat.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8bd72d3

tipc: remove ub->ubsock checks · d2c3a4ba

由 Xin Long 提交于 7月 02, 2019

Both tipc_udp_enable and tipc_udp_disable are called under rtnl_lock,
ub->ubsock could never be NULL in tipc_udp_disable and cleanup_bearer,
so remove the check.

Also remove the one in tipc_udp_enable by adding "free" label.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2c3a4ba

ipv4: Fix off-by-one in route dump counter without netlink strict checking · 885b8b4d

由 Stefano Brivio 提交于 6月 29, 2019

In commit ee28906f ("ipv4: Dump route exceptions if requested") I
added a counter of per-node dumped routes (including actual routes and
exceptions), analogous to the existing counter for dumped nodes. Dumping
exceptions means we need to also keep track of how many routes are dumped
for each node: this would be just one route per node, without exceptions.

When netlink strict checking is not enabled, we dump both routes and
exceptions at the same time: the RTM_F_CLONED flag is not used as a
filter. In this case, the per-node counter 'i_fa' is incremented by one
to track the single dumped route, then also incremented by one for each
exception dumped, and then stored as netlink callback argument as skip
counter, 's_fa', to be used when a partial dump operation restarts.

The per-node counter needs to be increased by one also when we skip a
route (exception) due to a previous non-zero skip counter, because it
needs to match the existing skip counter, if we are dumping both routes
and exceptions. I missed this, and only incremented the counter, for
regular routes, if the previous skip counter was zero. This means that,
in case of a mixed dump, partial dump operations after the first one
will start with a mismatching skip counter value, one less than expected.

This means in turn that the first exception for a given node is skipped
every time a partial dump operation restarts, if netlink strict checking
is not enabled (iproute < 5.0).

It turns out I didn't repeat the test in its final version, commit
de755a85 ("selftests: pmtu: Introduce list_flush_ipv4_exception test
case"), which also counts the number of route exceptions returned, with
iproute2 versions < 5.0 -- I was instead using the equivalent of the IPv6
test as it was before commit b964641e ("selftests: pmtu: Make
list_flush_ipv6_exception test more demanding").

Always increment the per-node counter by one if we previously dumped
a regular route, so that it matches the current skip counter.

Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

885b8b4d

net: ethernet: mediatek: Allow non TRGMII mode with MT7621 DDR2 devices · cce581a0

由 René van Dorst 提交于 6月 29, 2019

No reason to error out on a MT7621 device with DDR2 memory when non
TRGMII mode is selected.
Only MT7621 DDR2 clock setup is not supported for TRGMII mode.
But non TRGMII mode doesn't need any special clock setup.
Signed-off-by: NRené van Dorst <opensource@vdorst.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cce581a0

rxrpc: Fix uninitialized error code in rxrpc_send_data_packet() · 3427beb6

由 David Howells 提交于 7月 02, 2019

With gcc 4.1:

    net/rxrpc/output.c: In function ‘rxrpc_send_data_packet’:
    net/rxrpc/output.c:338: warning: ‘ret’ may be used uninitialized in this function

Indeed, if the first jump to the send_fragmentable label is made, and
the address family is not handled in the switch() statement, ret will be
used uninitialized.

Fix this by BUG()'ing as is done in other places in rxrpc where internal
support for future address families will need adding.  It should not be
possible to reach this normally as the address families are checked
up-front.

Fixes: 5a924b89 ("rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs")
Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3427beb6

nfc: st-nci: remove redundant assignment to variable r · 23ec8eaf

由 Colin Ian King 提交于 7月 02, 2019

The variable r is being initialized with a value that is never
read and it is being updated later with a new value. The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23ec8eaf

hinic: remove standard netdev stats · 83b6a85b

由 Xue Chaojing 提交于 7月 01, 2019

This patch removes standard netdev stats in ethtool -S.
Suggested-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NXue Chaojing <xuechaojing@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

83b6a85b

net: stmmac: Re-word Kconfig entry · b432bdb6

由 Jose Abreu 提交于 7月 02, 2019

We support many speeds and it doesn't make much sense to list them all
in the Kconfig. Let's just call it Multi-Gigabit.
Suggested-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b432bdb6

02 7月, 2019 17 次提交

Merge branch 'Add-gve-driver' · 337d1ccb

由 David S. Miller 提交于 7月 01, 2019

Catherine Sullivan says:

====================
Add gve driver

This patch series adds the gve driver which will support the
Compute Engine Virtual NIC that will be available in the future.

v2:
- Patch 1:
  - Remove gve_size_assert.h and use static_assert instead.
  - Loop forever instead of bugging if the device won't reset
  - Use module_pci_driver
- Patch 2:
  - Use be16_to_cpu in the RX Seq No define
  - Remove unneeded ndo_change_mtu
- Patch 3:
  - No Changes
- Patch 4:
  - Instead of checking netif_carrier_ok in ethtool stats, just make sure

v3:
- Patch 1:
  - Remove X86 dep
- Patch 2:
  - No changes
- Patch 3:
  - No changes
- Patch 4:
  - Remove unneeded memsets in ethtool stats

v4:
- Patch 1:
  - Use io[read|write]32be instead of [read|write]l(cpu_to_be32())
  - Explicitly add padding to gve_adminq_set_driver_parameter
  - Use static where appropriate
- Patch 2:
  - Use u64_stats_sync
  - Explicity add padding to gve_adminq_create_rx_queue
  - Fix some enianness typing issues found by kbuild
  - Use static where appropriate
  - Remove unused variables
- Patch 3:
  - Use io[read|write]32be instead of [read|write]l(cpu_to_be32())
- Patch 4:
  - Use u64_stats_sync
  - Use static where appropriate
Warnings reported by:
Reported-by: Nkbuild test robot <lkp@intel.com>
Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

337d1ccb

gve: Add ethtool support · e5b845dc

由 Catherine Sullivan 提交于 7月 01, 2019

Add support for the following ethtool commands:

ethtool -s|--change devname [msglvl N] [msglevel type on|off]
ethtool -S|--statistics devname
ethtool -i|--driver devname
ethtool -l|--show-channels devname
ethtool -L|--set-channels devname
ethtool -g|--show-ring devname
ethtool --reset devname
Signed-off-by: NCatherine Sullivan <csully@google.com>
Signed-off-by: NSagi Shahar <sagis@google.com>
Signed-off-by: NJon Olson <jonolson@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NLuigi Rizzo <lrizzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e5b845dc

gve: Add workqueue and reset support · 9e5f7d26

由 Catherine Sullivan 提交于 7月 01, 2019

Add support for the workqueue to handle management interrupts and
support for resets.
Signed-off-by: NCatherine Sullivan <csully@google.com>
Signed-off-by: NSagi Shahar <sagis@google.com>
Signed-off-by: NJon Olson <jonolson@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NLuigi Rizzo <lrizzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9e5f7d26

gve: Add transmit and receive support · f5cedc84

由 Catherine Sullivan 提交于 7月 01, 2019

Add support for passing traffic.
Signed-off-by: NCatherine Sullivan <csully@google.com>
Signed-off-by: NSagi Shahar <sagis@google.com>
Signed-off-by: NJon Olson <jonolson@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NLuigi Rizzo <lrizzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5cedc84

gve: Add basic driver framework for Compute Engine Virtual NIC · 893ce44d

由 Catherine Sullivan 提交于 7月 01, 2019

Add a driver framework for the Compute Engine Virtual NIC that will be
available in the future.

At this point the only functionality is loading the driver.
Signed-off-by: NCatherine Sullivan <csully@google.com>
Signed-off-by: NSagi Shahar <sagis@google.com>
Signed-off-by: NJon Olson <jonolson@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NLuigi Rizzo <lrizzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

893ce44d

Merge branch 'blackhole-device-to-invalidate-dst' · 2a8d8e0f

由 David S. Miller 提交于 7月 01, 2019

Mahesh Bandewar says:

====================
blackhole device to invalidate dst

When we invalidate dst or mark it "dead", we assign 'lo' to
dst->dev. First of all this assignment is racy and more over,
it has MTU implications.

The standard dev MTU is 1500 while the Loopback MTU is 64k. TCP
code when dereferencing the dst don't check if the dst is valid
or not. TCP when dereferencing a dead-dst while negotiating a
new connection, may use dst device which is 'lo' instead of
using the correct device. Consider the following scenario:

A SYN arrives on an interface and tcp-layer while processing
SYNACK finds a dst and associates it with SYNACK skb. Now before
skb gets passed to L3 for processing, if that dst gets "dead"
(because of the virtual device getting disappeared & then reappeared),
the 'lo' gets assigned to that dst (lo MTU = 64k). Let's assume
the SYN has ADV_MSS set as 9k while the output device through
which this SYNACK is going to go out has standard MTU of 1500.
The MTU check during the route check passes since MIN(9K, 64K)
is 9k and TCP successfully negotiates 9k MSS. The subsequent
data packet; bigger in size gets passed to the device and it
won't be marked as GSO since the assumed MTU of the device is
9k.

This either crashes the NIC and we have seen fixes that went
into drivers to handle this scenario. 8914a595 ('bnx2x:
disable GSO where gso_size is too big for hardware') and
2b16f048 ('net: create skb_gso_validate_mac_len()') and
with those fixes TCP eventually recovers but not before
few dropped segments.

Well, I'm not a TCP expert and though we have experienced
these corner cases in our environment, I could not reproduce
this case reliably in my test setup to try this fix myself.
However, Michael Chan <michael.chan@broadcom.com> had a setup
where these fixes helped him mitigate the issue and not cause
the crash.

The idea here is to not alter the data-path with additional
locks or smb()/rmb() barriers to avoid racy assignments but
to create a new device that has really low MTU that has
.ndo_start_xmit essentially a kfree_skb(). Make use of this
device instead of 'lo' when marking the dst dead.

First patch implements the blackhole device and second
patch uses it in IPv4 and IPv6 stack while the third patch
is the self test that ensures the sanity of this device.

v1->v2
  fixed the self-test patch to handle the conflict

v2 -> v3
  fixed Kconfig text/string.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a8d8e0f

blackhole_dev: add a selftest · 509e56b3

由 Mahesh Bandewar 提交于 7月 01, 2019

Since this is not really a device with all capabilities, this test
ensures that it has *enough* to make it through the data path
without causing unwanted side-effects (read crash!).
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

509e56b3

blackhole_netdev: use blackhole_netdev to invalidate dst entries · 8d7017fd

由 Mahesh Bandewar 提交于 7月 01, 2019

Use blackhole_netdev instead of 'lo' device with lower MTU when marking
dst "dead".
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Tested-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d7017fd

loopback: create blackhole net device similar to loopack. · 4de83b88

由 Mahesh Bandewar 提交于 7月 01, 2019

Create a blackhole net device that can be used for "dead"
dst entries instead of loopback device. This blackhole device differs
from loopback in few aspects: (a) It's not per-ns. (b) MTU on this
device is ETH_MIN_MTU (c) The xmit function is essentially kfree_skb().
and (d) since it's not registered it won't have ifindex.

Lower MTU effectively make the device not pass the MTU check during
the route check when a dst associated with the skb is dead.
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4de83b88

net: ethernet: broadcom: bcm63xx_enet: Remove unneeded memset · 8909783c

由 Hariprasad Kelam 提交于 6月 30, 2019

Remove unneeded memset as alloc_etherdev is using kvzalloc which uses
__GFP_ZERO flag
Signed-off-by: NHariprasad Kelam <hariprasad.kelam@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8909783c

Merge branch 'net-netsec-Add-XDP-Support' · fec3b9ec

由 David S. Miller 提交于 7月 01, 2019

Ilias Apalodimas says:

====================
net: netsec: Add XDP Support

This is a respin of https://www.spinics.net/lists/netdev/msg526066.html
Since page_pool API fixes are merged into net-next we can now safely use
it's DMA mapping capabilities.

First patch changes the buffer allocation from napi/netdev_alloc_frag()
to page_pool API. Although this will lead to slightly reduced performance
(on raw packet drops only) we can use the API for XDP buffer recycling.
Another side effect is a slight increase in memory usage, due to using a
single page per packet.

The second patch adds XDP support on the driver.
There's a bunch of interesting options that come up due to the single
Tx queue.
Locking is needed(to avoid messing up the Tx queues since ndo_xdp_xmit
and the normal stack can co-exist). We also need to track down the
'buffer type' for TX and properly free or recycle the packet depending
on it's nature.

Changes since RFC:
- Bug fixes from Jesper and Maciej
- Added page pool API to retrieve the DMA direction

Changes since v1:
- Use page_pool_free correctly if xdp_rxq_info_reg() failed
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fec3b9ec

net: netsec: add XDP support · ba2b2321

由 Ilias Apalodimas 提交于 6月 29, 2019

The interface only supports 1 Tx queue so locking is introduced on
the Tx queue if XDP is enabled to make sure .ndo_start_xmit and
.ndo_xdp_xmit won't corrupt Tx ring

- Performance (SMMU off)

Benchmark   XDP_SKB     XDP_DRV
xdp1        291kpps     344kpps
rxdrop      282kpps     342kpps

- Performance (SMMU on)
Benchmark   XDP_SKB     XDP_DRV
xdp1        167kpps     324kpps
rxdrop      164kpps     323kpps
Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba2b2321

net: page_pool: add helper function for retrieving dma direction · bb005f2a

由 Ilias Apalodimas 提交于 6月 29, 2019

Since the dma direction is stored in page pool params, offer an API
helper for driver that choose not to keep track of it locally
Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bb005f2a

net: netsec: Use page_pool API · 5c67bf0e

由 Ilias Apalodimas 提交于 6月 29, 2019

Use page_pool and it's DMA mapping capabilities for Rx buffers instead
of netdev/napi_alloc_frag()

Although this will result in a slight performance penalty on small sized
packets (~10%) the use of the API will allow to easily add XDP support.
The penalty won't be visible in network testing i.e ipef/netperf etc, it
only happens during raw packet drops.
Furthermore we intend to add recycling capabilities on the API
in the future. Once the recycling is added the performance penalty will
go away.
The only 'real' penalty is the slightly increased memory usage, since we
now allocate a page per packet instead of the amount of bytes we need +
skb metadata (difference is roughly 2kb per packet).
With a minimum of 4BG of RAM on the only SoC that has this NIC the
extra memory usage is negligible (a bit more on 64K pages)
Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c67bf0e

tc-testing: added tdc tests for prio qdisc · a8488b70

由 Roman Mashak 提交于 6月 28, 2019

Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8488b70

Merge branch 'mirred-batch-fixes' · c8881faf

由 David S. Miller 提交于 7月 01, 2019

Roman Mashak says:

====================
Fix batched event generation for mirred action

When adding or deleting a batch of entries, the kernel sends upto
TCA_ACT_MAX_PRIO entries in an event to user space. However it does not
consider that the action sizes may vary and require different skb sizes.

For example :

% cat tc-batch.sh
TC="sudo /mnt/iproute2.git/tc/tc"

$TC actions flush action mirred
for i in `seq 1 $1`;
do
   cmd="action mirred egress redirect dev lo index $i "
   args=$args$cmd
done
$TC actions add $args
%
% ./tc-batch.sh 32
Error: Failed to fill netlink attributes while adding TC action.
We have an error talking to the kernel
%

patch 1 adds callback in tc_action_ops of mirred action, which calculates
the action size, and passes size to tcf_add_notify()/tcf_del_notify().

patch 2 updates the TDC test suite with relevant test cases.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8881faf

R
tc-testing: updated mirred action tests with batch create/delete · 5d15a8ec
由 Roman Mashak 提交于 6月 28, 2019
```
Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
5d15a8ec

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功