提交 · cc69837fcaf467426ca19e5790085c26146a2300 · openeuler / Kernel

24 11月, 2020 2 次提交

net: don't include ethtool.h from netdevice.h · cc69837f

由 Jakub Kicinski 提交于 11月 20, 2020

linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.

Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.

Fix all the places which made use of this implicit include.
Acked-by: NJohannes Berg <johannes@sipsolutions.net>
Acked-by: NShannon Nelson <snelson@pensando.io>
Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

cc69837f

net: dsa: tag_hellcreek: Cleanup includes · 8551fad6

由 Kurt Kanzenbach 提交于 11月 21, 2020

Remove unused and add needed includes. No functional change.
Suggested-by: NVladimir Oltean <olteanv@gmail.com>
Signed-off-by: NKurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

8551fad6

22 11月, 2020 1 次提交

net: bridge: switch to net core statistics counters handling · 7609ecb2

由 Heiner Kallweit 提交于 11月 20, 2020

Use netdev->tstats instead of a member of net_bridge for storing
a pointer to the per-cpu counters. This allows us to use core
functionality for statistics handling.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/9bad2be2-fd84-7c6e-912f-cee433787018@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

7609ecb2

21 11月, 2020 11 次提交

mptcp: refine MPTCP-level ack scheduling · ea4ca586

由 Paolo Abeni 提交于 11月 19, 2020

Send timely MPTCP-level ack is somewhat difficult when
the insertion into the msk receive level is performed
by the worker.

It needs TCP-level dup-ack to notify the MPTCP-level
ack_seq increase, as both the TCP-level ack seq and the
rcv window are unchanged.

We can actually avoid processing incoming data with the
worker, and let the subflow or recevmsg() send ack as needed.

When recvmsg() moves the skbs inside the msk receive queue,
the msk space is still unchanged, so tcp_cleanup_rbuf() could
end-up skipping TCP-level ack generation. Anyway, when
__mptcp_move_skbs() is invoked, a known amount of bytes is
going to be consumed soon: we update rcv wnd computation taking
them in account.

Additionally we need to explicitly trigger tcp_cleanup_rbuf()
when recvmsg() consumes a significant amount of the receive buffer.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ea4ca586

mptcp: track window announced to peer · fa3fe2b1

由 Florian Westphal 提交于 11月 19, 2020

OoO handling attempts to detect when packet is out-of-window by testing
current ack sequence and remaining space vs. sequence number.

This doesn't work reliably. Store the highest allowed sequence number
that we've announced and use it to detect oow packets.

Do this when mptcp options get written to the packet (wire format).
For this to work we need to move the write_options call until after
stack selected a new tcp window.
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fa3fe2b1

mptcp: send out dedicated ADD_ADDR packet · 84dfe367

由 Geliang Tang 提交于 11月 19, 2020

When ADD_ADDR suboption includes an IPv6 address, the size is 28 octets.
It will not fit when other MPTCP suboptions are included in this packet,
e.g. DSS. So here we send out an ADD_ADDR dedicated packet to carry only
ADD_ADDR suboption, no other MPTCP suboptions.

In mptcp_pm_announce_addr, we check whether this is an IPv6 ADD_ADDR.
If it is, we set the flag MPTCP_ADD_ADDR_IPV6 to true. Then we call
mptcp_pm_nl_add_addr_send_ack to sent out a new pure ACK packet.

In mptcp_established_options_add_addr, we check whether this is a pure
ACK packet for ADD_ADDR. If it is, we drop all other MPTCP suboptions
in this packet, only put ADD_ADDR suboption in it.
Suggested-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

84dfe367

mptcp: change add_addr_signal type · d91d322a

由 Geliang Tang 提交于 11月 19, 2020

This patch changed the 'add_addr_signal' type from bool to char, so that
we could encode the addr type there.
Suggested-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

d91d322a

mptcp: keep unaccepted MPC subflow into join list · 0397c6d8

由 Paolo Abeni 提交于 11月 19, 2020

This will simplify all operation dealing with subflows
before accept time (e.g. data fin processing, add_addr).

The join list is already flushed by mptcp_stream_accept()
before returning the newly created msk to the user space.

This also fixes an potential bug present into the old code:
conn_list was manipulated without helding the msk lock
in mptcp_stream_accept().
Tested-by: NGeliang Tang <geliangtang@gmail.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0397c6d8

mptcp: skip to next candidate if subflow has unacked data · 860975c6

由 Florian Westphal 提交于 11月 19, 2020

In case a subflow path is blocked, MPTCP-level retransmit may not take
place anymore because such subflow is likely to have unacked data left
in its write queue.

Ignore subflows that have experienced loss and test next candidate.

Fixes: 3b1d6210 ("mptcp: implement and use MPTCP-level retransmission")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

860975c6

mptcp: fix state tracking for fallback socket · 26aa2314

由 Paolo Abeni 提交于 11月 19, 2020

We need to cope with some more state transition for
fallback sockets, or could still end-up moving to TCP_CLOSE
too early and avoid spooling some pending data

Fixes: e16163b6 ("mptcp: refactor shutdown and close")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

26aa2314

mptcp: drop WORKER_RUNNING status bit · b2771d24

由 Paolo Abeni 提交于 11月 19, 2020

Only mptcp_close() can actually cancel the workqueue,
no need to add and use this flag.
Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b2771d24

net: dsa: avoid potential use-after-free error · 30abc9cd

由 Christian Eggers 提交于 11月 19, 2020

If dsa_switch_ops::port_txtstamp() returns false, clone will be freed
immediately. Shouldn't store a pointer to freed memory.
Signed-off-by: NChristian Eggers <ceggers@arri.de>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Tested-by: NVladimir Oltean <olteanv@gmail.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201119110906.25558-1-ceggers@arri.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>

30abc9cd

net: add annotation for sock_{lock,unlock}_fast · 12f4bd86

由 Paolo Abeni 提交于 11月 17, 2020

The static checker is fooled by the non-static locking scheme
implemented by the mentioned helpers.
Let's make its life easier adding some unconditional annotation
so that the helpers are now interpreted as a plain spinlock from
sparse.

v1 -> v2:
 - add __releases() annotation to unlock_sock_fast()
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/6ed7ae627d8271fb7f20e0a9c6750fbba1ac2635.1605634911.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

12f4bd86

net: openvswitch: Be liberal in tcp conntrack. · e2ef5203

由 Numan Siddique 提交于 11月 16, 2020

There is no easy way to distinguish if a conntracked tcp packet is
marked invalid because of tcp_in_window() check error or because
it doesn't belong to an existing connection. With this patch,
openvswitch sets liberal tcp flag for the established sessions so
that out of window packets are not marked invalid.

A helper function - nf_ct_set_tcp_be_liberal(nf_conn) is added which
sets this flag for both the directions of the nf_conn.
Suggested-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NNuman Siddique <nusiddiq@redhat.com>
Acked-by: NFlorian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201116130126.3065077-1-nusiddiq@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

e2ef5203

20 11月, 2020 9 次提交

can: gw: support modification of Classical CAN DLCs · 94c23097

由 Oliver Hartkopp 提交于 11月 19, 2020

Add support for data length code modifications for Classical CAN.

The netlink configuration interface always allowed to pass any value
that fits into a byte, therefore only the modification process had to be
extended to handle the raw DLC represenation of Classical CAN frames.

When a DLC value from 0 .. F is provided for Classical CAN frame
modifications the 'len' value is modified as-is with the exception that
potentially existing 9 .. F DLC values in the len8_dlc element are moved
to the 'len' element for the modification operation by mod_retrieve_ccdlc().

After the modification the Classical CAN frame DLC information is brought
back into the correct format by mod_store_ccdlc() which is filling 'len'
and 'len8_dlc' accordingly.
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201119084921.2621-1-socketcan@hartkopp.netSigned-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

94c23097

can: replace can_dlc as variable/element for payload length · c7b74967

由 Oliver Hartkopp 提交于 11月 20, 2020

The naming of can_dlc as element of struct can_frame and also as variable
name is misleading as it claims to be a 'data length CODE' but in reality
it always was a plain data length.

With the indroduction of a new 'len' element in struct can_frame we can now
remove can_dlc as name and make clear which of the former uses was a plain
length (-> 'len') or a data length code (-> 'dlc') value.
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201120100444.3199-1-socketcan@hartkopp.net
[mkl: gs_usb: keep struct gs_host_frame::can_dlc as is]
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

c7b74967

IPv6: RTM_GETROUTE: Add RTA_ENCAP to result · 6b13d8f7

由 Oliver Herms 提交于 11月 19, 2020

This patch adds an IPv6 routes encapsulation attribute
to the result of netlink RTM_GETROUTE requests
(i.e. ip route get 2001:db8::).
Signed-off-by: NOliver Herms <oliver.peter.herms@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201118230651.GA8861@twsSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6b13d8f7

mptcp: update rtx timeout only if required. · b680a214

由 Paolo Abeni 提交于 11月 18, 2020

We must start the retransmission timer only there are
pending data in the rtx queue.
Otherwise we can hit a WARN_ON in mptcp_reset_timer(),
as syzbot demonstrated.

Reported-and-tested-by: syzbot+42aa53dafb66a07e5a24@syzkaller.appspotmail.com
Fixes: d9ca1de8 ("mptcp: move page frag allocation in mptcp_sendmsg()")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/r/1a72039f112cae048c44d398ffa14e0a1432db3d.1605737083.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b680a214

devlink: move flash end and begin to core devlink · 52cc5f3a

由 Jacob Keller 提交于 11月 18, 2020

When performing a flash update via devlink, device drivers may inform
user space of status updates via
devlink_flash_update_(begin|end|timeout|status)_notify functions.

It is expected that drivers do not send any status notifications unless
they send a begin and end message. If a driver sends a status
notification without sending the appropriate end notification upon
finishing (regardless of success or failure), the current implementation
of the devlink userspace program can get stuck endlessly waiting for the
end notification that will never come.

The current ice driver implementation may send such a status message
without the appropriate end notification in rare cases.

Fixing the ice driver is relatively simple: we just need to send the
begin_notify at the start of the function and always send an end_notify
no matter how the function exits.

Rather than assuming driver authors will always get this right in the
future, lets just fix the API so that it is not possible to get wrong.
Make devlink_flash_update_begin_notify and
devlink_flash_update_end_notify static, and call them in devlink.c core
code. Always send the begin_notify just before calling the driver's
flash_update routine. Always send the end_notify just after the routine
returns regardless of success or failure.

Doing this makes the status notification easier to use from the driver,
as it no longer needs to worry about catching failures and cleaning up
by calling devlink_flash_update_end_notify. It is now no longer possible
to do the wrong thing in this regard. We also save a couple of lines of
code in each driver.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Acked-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

52cc5f3a

devlink: move request_firmware out of driver · b44cfd4f

由 Jacob Keller 提交于 11月 18, 2020

All drivers which implement the devlink flash update support, with the
exception of netdevsim, use either request_firmware or
request_firmware_direct to locate the firmware file. Rather than having
each driver do this separately as part of its .flash_update
implementation, perform the request_firmware within net/core/devlink.c

Replace the file_name parameter in the struct devlink_flash_update_params
with a pointer to the fw object.

Use request_firmware rather than request_firmware_direct. Although most
Linux distributions today do not have the fallback mechanism
implemented, only about half the drivers used the _direct request, as
compared to the generic request_firmware. In the event that
a distribution does support the fallback mechanism, the devlink flash
update ought to be able to use it to provide the firmware contents. For
distributions which do not support the fallback userspace mechanism,
there should be essentially no difference between request_firmware and
request_firmware_direct.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Acked-by: NShannon Nelson <snelson@pensando.io>
Acked-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b44cfd4f

net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid() · 41a0be3f

由 Karsten Graul 提交于 11月 18, 2020

Sparse complaints 3 times about:
net/smc/smc_ib.c:203:52: warning: incorrect type in argument 1 (different address spaces)
net/smc/smc_ib.c:203:52:    expected struct net_device const *dev
net/smc/smc_ib.c:203:52:    got struct net_device [noderef] __rcu *const ndev

Fix that by using the existing and validated ndev variable instead of
accessing attr->ndev directly.

Fixes: 5102eca9 ("net/smc: Use rdma_read_gid_l2_fields to L2 fields")
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

41a0be3f

net/smc: fix matching of existing link groups · 0530bd6e

由 Karsten Graul 提交于 11月 18, 2020

With the multi-subnet support of SMC-Dv2 the match for existing link
groups should not include the vlanid of the network device.
Set ini->smcd_version accordingly before the call to smc_conn_create()
and use this value in smc_conn_create() to skip the vlanid check.

Fixes: 5c21c4cc ("net/smc: determine accepted ISM devices")
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0530bd6e

ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module · 2d8f6481

由 Georg Kohmann 提交于 11月 19, 2020

IPV6=m
NF_DEFRAG_IPV6=y

ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
`nf_ct_frag6_gather':
net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
`ipv6_frag_thdr_truncated'

Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
dependency is forcing IPV6=y.

Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
is the same solution as used with a similar issues: Referring to
commit 70b095c8 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
module")

Fixes: 9d9e937b ("ipv6/netfilter: Discard first fragment not including all headers")
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NGeorg Kohmann <geokohma@cisco.com>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

2d8f6481

19 11月, 2020 4 次提交

net: bridge: replace struct br_vlan_stats with pcpu_sw_netstats · 281cc284

由 Heiner Kallweit 提交于 11月 17, 2020

Struct br_vlan_stats duplicates pcpu_sw_netstats (apart from
br_vlan_stats not defining an alignment requirement), therefore
switch to using the latter one.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/04d25c3d-c5f6-3611-6d37-c2f40243dae2@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

281cc284

atm: nicstar: Replace in_interrupt() usage · f2bcc2fa

由 Sebastian Andrzej Siewior 提交于 11月 16, 2020

push_scqe() uses in_interrupt() to figure out if it is allowed to sleep.

The usage of in_interrupt() in drivers is phased out and Linus clearly
requested that code which changes behaviour depending on context should
either be separated or the context be conveyed in an argument passed by the
caller, which usually knows the context.

Aside of that in_interrupt() is not correct as it does not catch preempt
disabled regions which neither can sleep.

ns_send() (the only caller of push_scqe()) has the following callers:

- vcc_sendmsg() used as proto_ops::sendmsg is expected to be invoked in
  preemtible context.
  -> vcc->dev->ops->send() (ns_send())

- atm_vcc::send via atmdev_ops::send either directly (pointer copied by
  atm_init_aal34() or atm_init_aal5()) or via atm_send_aal0().
  This is invoked by drivers (like br2684, clip, pppoatm, ...) which are
  called from net_device_ops::ndo_start_xmit with BH disabled.

Add atmdev_ops::send_bh which is used by callers from BH context
(atm_send_aal*()) and if this callback missing then ::send is used
instead.
Implement this callback in nicstar and use it to replace in_interrupt().

Cc: Chas Williams <3chas3@gmail.com>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f2bcc2fa

net: Have netpoll bring-up DSA management interface · 1532b977

由 Florian Fainelli 提交于 11月 16, 2020

DSA network devices rely on having their DSA management interface up and
running otherwise their ndo_open() will return -ENETDOWN. Without doing
this it would not be possible to use DSA devices as netconsole when
configured on the command line. These devices also do not utilize the
upper/lower linking so the check about the netpoll device having upper
is not going to be a problem.

The solution adopted here is identical to the one done for
net/ipv4/ipconfig.c with 728c0208 ("net: ipv4: handle DSA enabled
master network devices"), with the network namespace scope being
restricted to that of the process configuring netpoll.

Fixes: 04ff53f9 ("net: dsa: Add netconsole support")
Tested-by: NVladimir Oltean <olteanv@gmail.com>
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201117035236.22658-1-f.fainelli@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1532b977

ah6: fix error return code in ah6_input() · a5ebcbdf

由 Zhang Changzhong 提交于 11月 17, 2020

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a5ebcbdf

18 11月, 2020 13 次提交

ipv4: use IS_ENABLED instead of ifdef · c09c8a27

由 Florian Klink 提交于 11月 15, 2020

Checking for ifdef CONFIG_x fails if CONFIG_x=m.

Use IS_ENABLED instead, which is true for both built-ins and modules.

Otherwise, a
> ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1
fails with the message "Error: IPv6 support not enabled in kernel." if
CONFIG_IPV6 is `m`.

In the spirit of b8127113.

Fixes: d1566268 ("ipv4: Allow ipv6 gateway with ipv4 routes")
Cc: Kim Phillips <kim.phillips@arm.com>
Signed-off-by: NFlorian Klink <flokli@flokli.de>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>

c09c8a27

inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill() · e33de7c5

由 Wang Hai 提交于 11月 16, 2020

nlmsg_cancel() needs to be called in the error path of
inet_req_diag_fill to cancel the message.

Fixes: d545caca ("net: inet: diag: expose the socket mark to privileged processes.")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NWang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

e33de7c5

bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list · 4363023d

由 John Fastabend 提交于 11月 16, 2020

When skb has a frag_list its possible for skb_to_sgvec() to fail. This
happens when the scatterlist has fewer elements to store pages than would
be needed for the initial skb plus any of its frags.

This case appears rare, but is possible when running an RX parser/verdict
programs exposed to the internet. Currently, when this happens we throw
an error, break the pipe, and kfree the msg. This effectively breaks the
application or forces it to do a retry.

Lets catch this case and handle it by doing an skb_linearize() on any
skb we receive with frags. At this point skb_to_sgvec should not fail
because the failing conditions would require frags to be in place.

Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370

4363023d

bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self · 2443ca66

由 John Fastabend 提交于 11月 16, 2020

If the skb_verdict_prog redirects an skb knowingly to itself, fix your
BPF program this is not optimal and an abuse of the API please use
SK_PASS. That said there may be cases, such as socket load balancing,
where picking the socket is hashed based or otherwise picks the same
socket it was received on in some rare cases. If this happens we don't
want to confuse userspace giving them an EAGAIN error if we can avoid
it.

To avoid double accounting in these cases. At the moment even if the
skb has already been charged against the sockets rcvbuf and forward
alloc we check it again and do set_owner_r() causing it to be orphaned
and recharged. For one this is useless work, but more importantly we
can have a case where the skb could be put on the ingress queue, but
because we are under memory pressure we return EAGAIN. The trouble
here is the skb has already been accounted for so any rcvbuf checks
include the memory associated with the packet already. This rolls
up and can result in unnecessary EAGAIN errors in userspace read()
calls.

Fix by doing an unlikely check and skipping checks if skb->sk == sk.

Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370

2443ca66

bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self · 6fa9201a

由 John Fastabend 提交于 11月 16, 2020

If a socket redirects to itself and it is under memory pressure it is
possible to get a socket stuck so that recv() returns EAGAIN and the
socket can not advance for some time. This happens because when
redirecting a skb to the same socket we received the skb on we first
check if it is OK to enqueue the skb on the receiving socket by checking
memory limits. But, if the skb is itself the object holding the memory
needed to enqueue the skb we will keep retrying from kernel side
and always fail with EAGAIN. Then userspace will get a recv() EAGAIN
error if there are no skbs in the psock ingress queue. This will continue
until either some skbs get kfree'd causing the memory pressure to
reduce far enough that we can enqueue the pending packet or the
socket is destroyed. In some cases its possible to get a socket
stuck for a noticeable amount of time if the socket is only receiving
skbs from sk_skb verdict programs. To reproduce I make the socket
memory limits ridiculously low so sockets are always under memory
pressure. More often though if under memory pressure it looks like
a spurious EAGAIN error on user space side causing userspace to retry
and typically enough has moved on the memory side that it works.

To fix skip memory checks and skb_orphan if receiving on the same
sock as already assigned.

For SK_PASS cases this is easy, its always the same socket so we
can just omit the orphan/set_owner pair.

For backlog cases we need to check skb->sk and decide if the orphan
and set_owner pair are needed.

Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370

6fa9201a

bpf, sockmap: Use truesize with sk_rmem_schedule() · 70796fb7

由 John Fastabend 提交于 11月 16, 2020

We use skb->size with sk_rmem_scheduled() which is not correct. Instead
use truesize to align with socket and tcp stack usage of sk_rmem_schedule.
Suggested-by: NDaniel Borkman <daniel@iogearbox.net>
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370

70796fb7

bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect · 36cd0e69

由 John Fastabend 提交于 11月 16, 2020

Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
allows users to tune SO_RCVBUF and sockmap will honor them.

We can refactor the if(charge) case out in later patches. But, keep this
fix to the point.

Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Suggested-by: NJakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370

36cd0e69

bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made · c9c89dcd

由 John Fastabend 提交于 11月 16, 2020

If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.

Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.

The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.

When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.

The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.

Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370

c9c89dcd

net/tls: Fix wrong record sn in async mode of device resync · 138559b9

由 Tariq Toukan 提交于 11月 15, 2020

In async_resync mode, we log the TCP seq of records until the async request
is completed. Later, in case one of the logged seqs matches the resync
request, we return it, together with its record serial number. Before this
fix, we mistakenly returned the serial number of the current record
instead.

Fixes: ed9b7646 ("net/tls: Add asynchronous resync")
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20201115131448.2702-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

138559b9

net: datagram: fix some kernel-doc markups · c1639be9

由 Mauro Carvalho Chehab 提交于 11月 16, 2020

Some identifiers have different names between their prototypes
and the kernel-doc markup.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c1639be9

net: wan: Delete the DLCI / SDLA drivers · f7365919

由 Xie He 提交于 11月 14, 2020

The DLCI driver (dlci.c) implements the Frame Relay protocol. However,
we already have another newer and better implementation of Frame Relay
provided by the HDLC_FR driver (hdlc_fr.c).

The DLCI driver's implementation of Frame Relay is used by only one
hardware driver in the kernel - the SDLA driver (sdla.c).

The SDLA driver provides Frame Relay support for the Sangoma S50x devices.
However, the vendor provides their own driver (along with their own
multi-WAN-protocol implementations including Frame Relay), called WANPIPE.
I believe most users of the hardware would use the vendor-provided WANPIPE
driver instead.

(The WANPIPE driver was even once in the kernel, but was deleted in
commit 8db60bcf ("[WAN]: Remove broken and unmaintained Sangoma
drivers.") because the vendor no longer updated the in-kernel WANPIPE
driver.)

Cc: Mike McLagan <mike.mclagan@linux.org>
Signed-off-by: NXie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201114150921.685594-1-xie.he.0141@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

f7365919

tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate · 1b9e2a8c

由 Ryan Sharpelletti 提交于 11月 16, 2020

During loss recovery, retransmitted packets are forced to use TCP
timestamps to calculate the RTT samples, which have a millisecond
granularity. BBR is designed using a microsecond granularity. As a
result, multiple RTT samples could be truncated to the same RTT value
during loss recovery. This is problematic, as BBR will not enter
PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning
that if there are persistent losses, PROBE_RTT will constantly be
pushed off and potentially never re-entered. This patch makes sure
that BBR enters PROBE_RTT by checking if RTT sample is < the current
min_rtt sample, rather than <=.

The Netflix transport/TCP team discovered this bug in the Linux TCP
BBR code during lab tests.

Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
Signed-off-by: NRyan Sharpelletti <sharpelletti@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1b9e2a8c

net: dsa: tag_dsa: Use a consistent comment style · 13f49b6f

由 Tobias Waldekranz 提交于 11月 15, 2020

Use a consistent style of one-line/multi-line comments throughout the
file.
Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

13f49b6f

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功