提交 · 5473a7bdad78f2efe8ee508e8c7bbb762896e78f · openeuler / Kernel

30 1月, 2019 6 次提交

devlink: Add support for driverinit set value for devlink_port · 5473a7bd

由 Vasundhara Volam 提交于 1月 28, 2019

Add support for "driverinit" configuration mode value for devlink_port
configuration parameters. Add devlink_port_param_driverinit_value_set()
function to help the driver set the value to devlink_port.

Also, move the common code to __devlink_param_driverinit_value_set()
to be used by both device and port params.

v7->v8:
Re-order the definitions as follows:
__devlink_param_driverinit_value_get
__devlink_param_driverinit_value_set
devlink_param_driverinit_value_get
devlink_param_driverinit_value_set
devlink_port_param_driverinit_value_get
devlink_port_param_driverinit_value_set

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5473a7bd

devlink: Add support for driverinit get value for devlink_port · ffd19b9a

由 Vasundhara Volam 提交于 1月 28, 2019

Add support for "driverinit" configuration mode value for devlink_port
configuration parameters. Add devlink_port_param_driverinit_value_get()
function to help the driver get the value from devlink_port.

Also, move the common code to __devlink_param_driverinit_value_get()
to be used by both device and port params.

v7->v8:
-Add the missing devlink_port_param_driverinit_value_get() declaration.
-Also, order devlink_port_param_driverinit_value_get() after
devlink_param_driverinit_value_get/set() calls

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ffd19b9a

devlink: Add port param set command · 9c54873b

由 Vasundhara Volam 提交于 1月 28, 2019

Add port param set command to set the value for a parameter.
Value can be set to any of the supported configuration modes.

v7->v8: Append "Acked-by: Jiri Pirko <jiri@mellanox.com>"

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c54873b

devlink: Add port param get command · f4601dee

由 Vasundhara Volam 提交于 1月 28, 2019

Add port param get command which gets data per parameter.
It also has option to dump the parameters data per port.

v7->v8: Append "Acked-by: Jiri Pirko <jiri@mellanox.com>"

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4601dee

devlink: Add devlink_param for port register and unregister · 39e6160e

由 Vasundhara Volam 提交于 1月 28, 2019

Add functions to register and unregister for the driver supported
configuration parameters table per port.

v7->v8:
- Order the definitions following way as suggested by Jiri.
__devlink_params_register
__devlink_params_unregister
devlink_params_register
devlink_params_unregister
devlink_port_params_register
devlink_port_params_unregister
- Append with Acked-by: Jiri Pirko <jiri@mellanox.com>.

v2->v3:
- Add a helper __devlink_params_register() with common code used by
  both devlink_params_register() and devlink_port_params_register().

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39e6160e

net: set default network namespace in init_dummy_netdev() · 35edfdc7

由 Josh Elsasser 提交于 1月 26, 2019

Assign a default net namespace to netdevs created by init_dummy_netdev().
Fixes a NULL pointer dereference caused by busy-polling a socket bound to
an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat
if napi_poll() received packets:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
  IP: napi_busy_loop+0xd6/0x200
  Call Trace:
    sock_poll+0x5e/0x80
    do_sys_poll+0x324/0x5a0
    SyS_poll+0x6c/0xf0
    do_syscall_64+0x6b/0x1f0
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 7db6b048 ("net: Commonize busy polling code to focus on napi_id instead of socket")
Signed-off-by: NJosh Elsasser <jelsasser@appneta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35edfdc7

29 1月, 2019 2 次提交

bpf: add BPF_PROG_TEST_RUN support for flow dissector · b7a1848e

由 Stanislav Fomichev 提交于 1月 28, 2019

The input is packet data, the output is struct bpf_flow_key. This should
make it easy to test flow dissector programs without elaborate
setup.
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

b7a1848e

net/flow_dissector: move bpf case into __skb_flow_bpf_dissect · c8aa7038

由 Stanislav Fomichev 提交于 1月 28, 2019

This way, we can reuse it for flow dissector in BPF_PROG_TEST_RUN.

No functional changes.
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

c8aa7038

26 1月, 2019 1 次提交

net: Revert devlink health changes. · 30e5c2c6

由 David S. Miller 提交于 1月 25, 2019

This reverts the devlink health changes from 9/17/2019,
Jiri wants things to be designed differently and it was
agreed that the easiest way to do this is start from the
beginning again.

Commits reverted:

cb5ccfbe
880ee82f
c7af343b
ff253fed
6f9d5613
fcd852c6
8a66704a
12bd0dce
aba25279
ce019faa
b8c45a03

And the follow-on build fix:

o33a0efa4Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30e5c2c6

24 1月, 2019 1 次提交

bpf: allow BPF programs access skb_shared_info->gso_segs field · d9ff286a

由 Eric Dumazet 提交于 1月 23, 2019

This adds the ability to read gso_segs from a BPF program.

v3: Use BPF_REG_AX instead of BPF_REG_TMP for the temporary register,
    as suggested by Martin.

v2: refined Eddie Hao patch to address Alexei feedback.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Eddie Hao <eddieh@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

d9ff286a

23 1月, 2019 3 次提交

devlink: Use DIV_ROUND_UP_ULL in DEVLINK_HEALTH_SIZE_TO_BUFFERS · 33a0efa4

由 Nathan Chancellor 提交于 1月 21, 2019

When building this code on a 32-bit platform such as ARM, there is a
link time error (lld error shown, happpens with ld.bfd too):

ld.lld: error: undefined symbol: __aeabi_uldivmod
>>> referenced by devlink.c
>>>               net/core/devlink.o:(devlink_health_buffers_create) in archive built-in.a

This happens when using a regular division symbol with a u64 dividend.
Use DIV_ROUND_UP_ULL, which wraps do_div, to avoid this situation.

Fixes: cb5ccfbe ("devlink: Add health buffer support")
Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33a0efa4

devlink: Add missing check of nlmsg_put · ed175d9c

由 YueHaibing 提交于 1月 21, 2019

nlmsg_put may fail, this fix add a check of its return value.
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed175d9c

net: introduce a knob to control whether to inherit devconf config · 856c395c

由 Cong Wang 提交于 1月 17, 2019

There have been many people complaining about the inconsistent
behaviors of IPv4 and IPv6 devconf when creating new network
namespaces.  Currently, for IPv4, we inherit all current settings
from init_net, but for IPv6 we reset all setting to default.

This patch introduces a new /proc file
/proc/sys/net/core/devconf_inherit_init_net to control the
behavior of whether to inhert sysctl current settings from init_net.
This file itself is only available in init_net.

As demonstrated below:

Initial setup in init_net:
 # cat /proc/sys/net/ipv4/conf/all/rp_filter
 2
 # cat /proc/sys/net/ipv6/conf/all/accept_dad
 1

Default value 0 (current behavior):
 # ip netns del test
 # ip netns add test
 # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
 2
 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
 0

Set to 1 (inherit from init_net):
 # echo 1 > /proc/sys/net/core/devconf_inherit_init_net
 # ip netns del test
 # ip netns add test
 # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
 2
 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
 1

Set to 2 (reset to default):
 # echo 2 > /proc/sys/net/core/devconf_inherit_init_net
 # ip netns del test
 # ip netns add test
 # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
 0
 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
 0

Set to a value out of range (invalid):
 # echo 3 > /proc/sys/net/core/devconf_inherit_init_net
 -bash: echo: write error: Invalid argument
 # echo -1 > /proc/sys/net/core/devconf_inherit_init_net
 -bash: echo: write error: Invalid argument
Reported-by: NZhu Yanjun <Yanjun.Zhu@windriver.com>
Reported-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

856c395c

20 1月, 2019 6 次提交

bpf: in __bpf_redirect_no_mac pull mac only if present · e7c87bd6

由 Willem de Bruijn 提交于 1月 15, 2019

Syzkaller was able to construct a packet of negative length by
redirecting from bpf_prog_test_run_skb with BPF_PROG_TYPE_LWT_XMIT:

    BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:345 [inline]
    BUG: KASAN: slab-out-of-bounds in skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
    BUG: KASAN: slab-out-of-bounds in __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
    Read of size 4294967282 at addr ffff8801d798009c by task syz-executor2/12942

    kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
    check_memory_region_inline mm/kasan/kasan.c:260 [inline]
    check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
    memcpy+0x23/0x50 mm/kasan/kasan.c:302
    memcpy include/linux/string.h:345 [inline]
    skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
    __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
    __pskb_copy include/linux/skbuff.h:1053 [inline]
    pskb_copy include/linux/skbuff.h:2904 [inline]
    skb_realloc_headroom+0xe7/0x120 net/core/skbuff.c:1539
    ipip6_tunnel_xmit net/ipv6/sit.c:965 [inline]
    sit_tunnel_xmit+0xe1b/0x30d0 net/ipv6/sit.c:1029
    __netdev_start_xmit include/linux/netdevice.h:4325 [inline]
    netdev_start_xmit include/linux/netdevice.h:4334 [inline]
    xmit_one net/core/dev.c:3219 [inline]
    dev_hard_start_xmit+0x295/0xc90 net/core/dev.c:3235
    __dev_queue_xmit+0x2f0d/0x3950 net/core/dev.c:3805
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
    __bpf_tx_skb net/core/filter.c:2016 [inline]
    __bpf_redirect_common net/core/filter.c:2054 [inline]
    __bpf_redirect+0x5cf/0xb20 net/core/filter.c:2061
    ____bpf_clone_redirect net/core/filter.c:2094 [inline]
    bpf_clone_redirect+0x2f6/0x490 net/core/filter.c:2066
    bpf_prog_41f2bcae09cd4ac3+0xb25/0x1000

The generated test constructs a packet with mac header, network
header, skb->data pointing to network header and skb->len 0.

Redirecting to a sit0 through __bpf_redirect_no_mac pulls the
mac length, even though skb->data already is at skb->network_header.
bpf_prog_test_run_skb has already pulled it as LWT_XMIT !is_l2.

Update the offset calculation to pull only if skb->data differs
from skb->network_header, which is not true in this case.

The test itself can be run only from commit 1cf1cae9 ("bpf:
introduce BPF_PROG_TEST_RUN command"), but the same type of packets
with skb at network header could already be built from lwt xmit hooks,
so this fix is more relevant to that commit.

Also set the mac header on redirect from LWT_XMIT, as even after this
change to __bpf_redirect_no_mac that field is expected to be set, but
is not yet in ip_finish_output2.

Fixes: 3a0af8fd ("bpf: BPF for lightweight tunnel infrastructure")
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e7c87bd6

net: sock: do not set sk_cookie in sk_clone_lock() · 0726f558

由 Yafang Shao 提交于 1月 18, 2019

The only call site of sk_clone_lock is in inet_csk_clone_lock,
and sk_cookie will be set there.
So we don't need to set sk_cookie in sk_clone_lock().
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0726f558

net: namespace: perform strict checks also for doit handlers · 4d165f61

由 Jakub Kicinski 提交于 1月 18, 2019

Make RTM_GETNSID's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.

v2: - don't check size >= sizeof(struct rtgenmsg) (Nicolas).
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d165f61

rtnetlink: ifinfo: perform strict checks also for doit handler · 9b3757b0

由 Jakub Kicinski 提交于 1月 18, 2019

Make RTM_GETLINK's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b3757b0

rtnetlink: stats: reject requests for unknown stats · 6300acb2

由 Jakub Kicinski 提交于 1月 18, 2019

In the spirit of strict checks reject requests of stats the kernel
does not support when NETLINK_F_STRICT_CHK is set.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6300acb2

rtnetlink: stats: validate attributes in get as well as dumps · 51bc860d

由 Jakub Kicinski 提交于 1月 18, 2019

Make sure NETLINK_GET_STRICT_CHK influences both GETSTATS doit
as well as the dump.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51bc860d

19 1月, 2019 8 次提交

devlink: Add health dump {get,clear} commands · 12bd0dce

由 Eran Ben Elisha 提交于 1月 17, 2019

Add devlink health dump commands, in order to run an dump operation
over a specific reporter.

The supported operations are dump_get in order to get last saved
dump (if not exist, dump now) and dump_clear to clear last saved
dump.

It is expected from driver's callback for diagnose command to fill it
via the buffer descriptors API. Devlink will parse it and convert it to
netlink nla API in order to pass it to the user.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12bd0dce

devlink: Add health diagnose command · 8a66704a

由 Eran Ben Elisha 提交于 1月 17, 2019

Add devlink health diagnose command, in order to run a diagnose
operation over a specific reporter.

It is expected from driver's callback for diagnose command to fill it
via the buffer descriptors API. Devlink will parse it and convert it to
netlink nla API in order to pass it to the user.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a66704a

devlink: Add health recover command · fcd852c6

由 Eran Ben Elisha 提交于 1月 17, 2019

Add devlink health recover command to the uapi, in order to allow the user
to execute a recover operation over a specific reporter.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fcd852c6

devlink: Add health set command · 6f9d5613

由 Eran Ben Elisha 提交于 1月 17, 2019

Add devlink health set command, in order to set configuration parameters
for a specific reporter.
Supported parameters are:
- graceful_period: Time interval between auto recoveries (in msec)
- auto_recover: Determines if the devlink shall execute recover upon
		receiving error for the reporter
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f9d5613

devlink: Add health get command · ff253fed

由 Eran Ben Elisha 提交于 1月 17, 2019

Add devlink health get command to provide reporter/s data for user space.
Add the ability to get data per reporter or dump data from all available
reporters.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff253fed

devlink: Add health report functionality · c7af343b

由 Eran Ben Elisha 提交于 1月 17, 2019

Upon error discover, every driver can report it to the devlink health
mechanism via devlink_health_report function, using the appropriate
reporter registered to it. Driver can pass error specific context which
will be delivered to it as part of the dump / recovery callbacks.

Once an error is reported, devlink health will do the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and stored at the reporter instance (as long
  as there is no other dump which is already stored)
* Auto recovery attempt is being done. depends on:
  - Auto Recovery configuration
  - Grace period vs. time since last recover
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7af343b

devlink: Add health reporter create/destroy functionality · 880ee82f

由 Eran Ben Elisha 提交于 1月 17, 2019

Devlink health reporter is an instance for reporting, diagnosing and
recovering from run time errors discovered by the reporters.
Define it's data structure and supported operations.
In addition, expose devlink API to create and destroy a reporter.
Each devlink instance will hold it's own reporters list.

As part of the allocation, driver shall provide a set of callbacks which
will be used the devlink in order to handle health reports and user
commands related to this reporter. In addition, driver is entitled to
provide some priv pointer, which can be fetched from the reporter by
devlink_health_reporter_priv function.

For each reporter, devlink will hold a metadata of statistics,
buffers and status.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

880ee82f

devlink: Add health buffer support · cb5ccfbe

由 Eran Ben Elisha 提交于 1月 17, 2019

Devlink health buffer is a mechanism to pass descriptors between drivers
and devlink. The API allows the driver to add objects, object pair,
value array (nested attributes), value and name.

Driver can use this API to fill the buffers in a format which can be
translated by the devlink to the netlink message.

In order to fulfill it, an internal buffer descriptor is defined. This
will hold the data and metadata per each attribute and by used to pass
actual commands to the netlink.

This mechanism will be later used in devlink health for dump and diagnose
data store by the drivers.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb5ccfbe

18 1月, 2019 7 次提交

neighbour: Do not perturb drop profiles when neigh_probe · 87fff3ca

由 Yang Wei 提交于 1月 17, 2019

Replace the kfree_skb() by consume_skb() to be drop monitor(dropwatch,
perf) friendly.
Signed-off-by: NYang Wei <yang.wei9@zte.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87fff3ca

net: add a route cache full diagnostic message · 22c2ad61

由 Peter Oskolkov 提交于 1月 16, 2019

In some testing scenarios, dst/route cache can fill up so quickly
that even an explicit GC call occasionally fails to clean it up. This leads
to sporadically failing calls to dst_alloc and "network unreachable" errors
to the user, which is confusing.

This patch adds a diagnostic message to make the cause of the failure
easier to determine.
Signed-off-by: NPeter Oskolkov <posk@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22c2ad61

bpf: fix SO_MAX_PACING_RATE to support TCP internal pacing · e224c390

由 Yuchung Cheng 提交于 1月 17, 2019

If sch_fq packet scheduler is not used, TCP can fallback to
internal pacing, but this requires sk_pacing_status to
be properly set.

Fixes: 8c4b4c7e ("bpf: Add setsockopt helper function to bpf")
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e224c390

bpf: bpf_setsockopt: reset sock dst on SO_MARK changes · f4924f24

由 Peter Oskolkov 提交于 1月 16, 2019

In sock_setsockopt() (net/core/sock.h), when SO_MARK option is used
to change sk_mark, sk_dst_reset(sk) is called. The same should be
done in bpf_setsockopt().

Fixes: 8c4b4c7e ("bpf: Add setsockopt helper function to bpf")
Reported-by: NMaciej Żenczykowski <maze@google.com>
Signed-off-by: NPeter Oskolkov <posk@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Reviewed-by: NMaciej Żenczykowski <maze@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

f4924f24

net: Add extack argument to ndo_fdb_add() · 87b0984e

由 Petr Machata 提交于 1月 16, 2019

Drivers may not be able to support certain FDB entries, and an error
code is insufficient to give clear hints as to the reasons of rejection.

In order to make it possible to communicate the rejection reason, extend
ndo_fdb_add() with an extack argument. Adapt the existing
implementations of ndo_fdb_add() to take the parameter (and ignore it).
Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add().
Signed-off-by: NPetr Machata <petrm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87b0984e

net: introduce SO_BINDTOIFINDEX sockopt · f5dd3d0c

由 David Herrmann 提交于 1月 15, 2019

This introduces a new generic SOL_SOCKET-level socket option called
SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a
network interface index as argument, rather than the network interface
name.

User-space often refers to network-interfaces via their index, but has
to temporarily resolve it to a name for a call into SO_BINDTODEVICE.
This might pose problems when the network-device is renamed
asynchronously by other parts of the system. When this happens, the
SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong
device.

In most cases user-space only ever operates on devices which they
either manage themselves, or otherwise have a guarantee that the device
name will not change (e.g., devices that are UP cannot be renamed).
However, particularly in libraries this guarantee is non-obvious and it
would be nice if that race-condition would simply not exist. It would
make it easier for those libraries to operate even in situations where
the device-name might change under the hood.

A real use-case that we recently hit is trying to start the network
stack early in the initrd but make it survive into the real system.
Existing distributions rename network-interfaces during the transition
from initrd into the real system. This, obviously, cannot affect
devices that are up and running (unless you also consider moving them
between network-namespaces). However, the network manager now has to
make sure its management engine for dormant devices will not run in
parallel to these renames. Particularly, when you offload operations
like DHCP into separate processes, these might setup their sockets
early, and thus have to resolve the device-name possibly running into
this race-condition.

By avoiding a call to resolve the device-name, we no longer depend on
the name and can run network setup of dormant devices in parallel to
the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this
race.
Reviewed-by: NTom Gundersen <teg@jklm.no>
Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5dd3d0c

Optimize sk_msg_clone() by data merge to end dst sg entry · fda497e5

由 Vakul Garg 提交于 1月 16, 2019

Function sk_msg_clone has been modified to merge the data from source sg
entry to destination sg entry if the cloned data resides in same page
and is contiguous to the end entry of destination sk_msg. This improves
kernel tls throughput to the tune of 10%.

When the user space tls application calls sendmsg() with MSG_MORE, it leads
to calling sk_msg_clone() with new data being cloned placed continuous to
previously cloned data. Without this optimization, a new SG entry in
the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg()
gets used. This leads to exhaustion of sg entries in rec->msg_plaintext
even before a full 16K of allowable record data is accumulated. Hence we
lose oppurtunity to encrypt and send a full 16K record.

With this patch, the kernel tls can accumulate full 16K of record data
irrespective of the size of data passed in sendmsg() with MSG_MORE.
Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fda497e5

17 1月, 2019 2 次提交

bpf: Correctly annotate implicit fall through in bpf_base_func_proto · c61c2768

由 Mathieu Malaterre 提交于 1月 16, 2019

There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warnings (W=1).

To preserve as much of the existing comment only change a ‘:’ into a ‘,’.
This is enough change, to match the regular expression expected by GCC.

This commit removes the following warning:

net/core/filter.c:5310:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: NMathieu Malaterre <malat@debian.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

c61c2768

net/core/neighbour: fix kmemleak minimal reference count for hash tables · 01b833ab

由 Konstantin Khlebnikov 提交于 1月 14, 2019

This should be 1 for normal allocations, 0 disables leak reporting.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
Fixes: 85704cb8 ("net/core/neighbour: tell kmemleak about hash tables")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01b833ab

10 1月, 2019 2 次提交

net/core/neighbour: tell kmemleak about hash tables · 85704cb8

由 Konstantin Khlebnikov 提交于 1月 08, 2019

This fixes false-positive kmemleak reports about leaked neighbour entries:

unreferenced object 0xffff8885c6e4d0a8 (size 1024):
  comm "softirq", pid 0, jiffies 4294922664 (age 167640.804s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 20 2c f3 83 ff ff ff ff  ........ ,......
    08 c0 ef 5f 84 88 ff ff 01 8c 7d 02 01 00 00 00  ..._......}.....
  backtrace:
    [<00000000748509fe>] ip6_finish_output2+0x887/0x1e40
    [<0000000036d7a0d8>] ip6_output+0x1ba/0x600
    [<0000000027ea7dba>] ip6_send_skb+0x92/0x2f0
    [<00000000d6e2111d>] udp_v6_send_skb.isra.24+0x680/0x15e0
    [<000000000668a8be>] udpv6_sendmsg+0x18c9/0x27a0
    [<000000004bd5fa90>] sock_sendmsg+0xb3/0xf0
    [<000000008227b29f>] ___sys_sendmsg+0x745/0x8f0
    [<000000008698009d>] __sys_sendmsg+0xde/0x170
    [<00000000889dacf1>] do_syscall_64+0x9b/0x400
    [<0000000081cdb353>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005767ed39>] 0xffffffffffffffff
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85704cb8

bpf: correctly set initial window on active Fast Open sender · 31aa6503

由 Yuchung Cheng 提交于 1月 08, 2019

The existing BPF TCP initial congestion window (TCP_BPF_IW) does not
to work on (active) Fast Open sender. This is because it changes the
(initial) window only if data_segs_out is zero -- but data_segs_out
is also incremented on SYN-data.  This patch fixes the issue by
proerly accounting for SYN-data additionally.

Fixes: fc747810 ("bpf: Adds support for setting initial cwnd")
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Reviewed-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

31aa6503

06 1月, 2019 1 次提交

jump_label: move 'asm goto' support test to Kconfig · e9666d10

由 Masahiro Yamada 提交于 12月 31, 2018

Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

The jump label is controlled by HAVE_JUMP_LABEL, which is defined
like this:

  #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
  # define HAVE_JUMP_LABEL
  #endif

We can improve this by testing 'asm goto' support in Kconfig, then
make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
match to the real kernel capability.
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: NSedat Dilek <sedat.dilek@gmail.com>

e9666d10

05 1月, 2019 1 次提交

net, skbuff: do not prefer skb allocation fails early · f8c468e8

由 David Rientjes 提交于 1月 02, 2019

Commit dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by
__GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in
alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may
directly reclaim.

The previous behavior would require reclaim up to 1 << order pages for
skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing,
otherwise the allocations in alloc_skb() would loop in the page allocator
looking for memory. __GFP_RETRY_MAYFAIL makes both allocations failable
under memory pressure, including for the HEAD allocation.

This can cause, among many other things, write() to fail with ENOTCONN
during RPC when under memory pressure.

These allocations should succeed as they did previous to dcda9b04
even if it requires calling the oom killer and additional looping in the
page allocator to find memory. There is no way to specify the previous
behavior of __GFP_REPEAT, but it's unlikely to be necessary since the
previous behavior only guaranteed that 1 << order pages would be reclaimed
before failing for order > PAGE_ALLOC_COSTLY_ORDER. That reclaim is not
guaranteed to be contiguous memory, so repeating for such large orders is
usually not beneficial.

Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous
behavior, specifically not allowing alloc_skb() to fail for small orders
and oom kill if necessary rather than allowing RPCs to fail.

Fixes: dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8c468e8

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功