提交 · 9adc89af724f12a03b47099cd943ed54e877cd59 · openeuler / Kernel

31 3月, 2021 1 次提交

net: let skb_orphan_partial wake-up waiters. · 9adc89af

由 Paolo Abeni 提交于 3月 30, 2021

Currently the mentioned helper can end-up freeing the socket wmem
without waking-up any processes waiting for more write memory.

If the partially orphaned skb is attached to an UDP (or raw) socket,
the lack of wake-up can hang the user-space.

Even for TCP sockets not calling the sk destructor could have bad
effects on TSQ.

Address the issue using skb_orphan to release the sk wmem before
setting the new sock_efree destructor. Additionally bundle the
whole ownership update in a new helper, so that later other
potential users could avoid duplicate code.

v1 -> v2:
 - use skb_orphan() instead of sort of open coding it (Eric)
 - provide an helper for the ownership change (Eric)

Fixes: f6ba8d33 ("netem: fix skb_orphan_partial()")
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9adc89af

12 3月, 2021 1 次提交

net: sock: simplify tw proto registration · b80350f3

由 Tonghao Zhang 提交于 3月 11, 2021

Introduce the new function tw_prot_init (inspired by
req_prot_init) to simplify "proto_register" function.

tw_prot_cleanup will take care of a partially initialized
timewait_sock_ops.
Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b80350f3

04 2月, 2021 1 次提交

net: indirect call helpers for ipv4/ipv6 dst_check functions · bbd807df

由 Brian Vazquez 提交于 2月 01, 2021

This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bbd807df

30 1月, 2021 1 次提交

net: Remove redundant calls of sk_tx_queue_clear(). · df610cd9

由 Kuniyuki Iwashima 提交于 1月 29, 2021

The commit 41b14fb8 ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4 ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.

If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.

Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.

CC: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>

df610cd9

29 1月, 2021 1 次提交

net: reduce indentation level in sk_clone_lock() · bbc20b70

由 Eric Dumazet 提交于 1月 27, 2021

Rework initial test to jump over init code
if memory allocation has failed.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210127152731.748663-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

bbc20b70

05 12月, 2020 1 次提交

net: Remove the err argument from sock_from_file · dba4a925

由 Florent Revest 提交于 12月 04, 2020

Currently, the sock_from_file prototype takes an "err" pointer that is
either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
makes the error redundant and it is ignored by a few callers.

This patch simplifies the API by letting callers deduce the error based
on whether the returned socket is NULL or not.
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NFlorent Revest <revest@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NKP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com

dba4a925

01 12月, 2020 3 次提交

mptcp: open code mptcp variant for lock_sock · ad80b0fc

由 Paolo Abeni 提交于 11月 27, 2020

This allows invoking an additional callback under the
socket spin lock.

Will be used by the next patches to avoid additional
spin lock contention.
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ad80b0fc

net: Add SO_BUSY_POLL_BUDGET socket option · 7c951caf

由 Björn Töpel 提交于 11月 30, 2020

This option lets a user set a per socket NAPI budget for
busy-polling. If the options is not set, it will use the default of 8.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com

7c951caf

net: Introduce preferred busy-polling · 7fd3253a

由 Björn Töpel 提交于 11月 30, 2020

The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.

One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.

This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.

If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.

In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.

Example usage:

  $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
  $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout

Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.

Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com

7fd3253a

21 11月, 2020 1 次提交

net: add annotation for sock_{lock,unlock}_fast · 12f4bd86

由 Paolo Abeni 提交于 11月 17, 2020

The static checker is fooled by the non-static locking scheme
implemented by the mentioned helpers.
Let's make its life easier adding some unconditional annotation
so that the helpers are now interpreted as a plain spinlock from
sparse.

v1 -> v2:
 - add __releases() annotation to unlock_sock_fast()
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/6ed7ae627d8271fb7f20e0a9c6750fbba1ac2635.1605634911.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

12f4bd86

23 10月, 2020 1 次提交

net: Properly typecast int values to set sk_max_pacing_rate · 700465fd

由 Ke Li 提交于 10月 22, 2020

In setsockopt(SO_MAX_PACING_RATE) on 64bit systems, sk_max_pacing_rate,
after extended from 'u32' to 'unsigned long', takes unintentionally
hiked value whenever assigned from an 'int' value with MSB=1, due to
binary sign extension in promoting s32 to u64, e.g. 0x80000000 becomes
0xFFFFFFFF80000000.

Thus inflated sk_max_pacing_rate causes subsequent getsockopt to return
~0U unexpectedly. It may also result in increased pacing rate.

Fix by explicitly casting the 'int' value to 'unsigned int' before
assigning it to sk_max_pacing_rate, for zero extension to happen.

Fixes: 76a9ebe8 ("net: extend sk_pacing_rate to unsigned long")
Signed-off-by: NJi Li <jli@akamai.com>
Signed-off-by: NKe Li <keli@akamai.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022064146.79873-1-keli@akamai.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

700465fd

14 10月, 2020 2 次提交

socket: don't clear SOCK_TSTAMP_NEW when SO_TIMESTAMPNS is disabled · 4e3bbb33

由 Christian Eggers 提交于 10月 12, 2020

SOCK_TSTAMP_NEW (timespec64 instead of timespec) is also used for
hardware time stamps (configured via SO_TIMESTAMPING_NEW).

User space (ptp4l) first configures hardware time stamping via
SO_TIMESTAMPING_NEW which sets SOCK_TSTAMP_NEW. In the next step, ptp4l
disables SO_TIMESTAMPNS(_NEW) (software time stamps), but this must not
switch hardware time stamps back to "32 bit mode".

This problem happens on 32 bit platforms were the libc has already
switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW
socket options). ptp4l complains with "missing timestamp on transmitted
peer delay request" because the wrong format is received (and
discarded).

Fixes: 887feae3 ("socket: Add SO_TIMESTAMP[NS]_NEW")
Fixes: 783da70e ("net: add sock_enable_timestamps")
Signed-off-by: NChristian Eggers <ceggers@arri.de>
Acked-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

4e3bbb33

socket: fix option SO_TIMESTAMPING_NEW · 59e611a5

由 Christian Eggers 提交于 10月 12, 2020

The comparison of optname with SO_TIMESTAMPING_NEW is wrong way around,
so SOCK_TSTAMP_NEW will first be set and than reset again. Additionally
move it out of the test for SOF_TIMESTAMPING_RX_SOFTWARE as this seems
unrelated.

This problem happens on 32 bit platforms were the libc has already
switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW
socket options). ptp4l complains with "missing timestamp on transmitted
peer delay request" because the wrong format is received (and
discarded).

Fixes: 9718475e ("socket: Add SO_TIMESTAMPING_NEW")
Signed-off-by: NChristian Eggers <ceggers@arri.de>
Reviewed-by: NWillem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

59e611a5

25 9月, 2020 1 次提交

mptcp: add sk_stop_timer_sync helper · 08b81d87

由 Geliang Tang 提交于 9月 24, 2020

This patch added a new helper sk_stop_timer_sync, it deactivates a timer
like sk_stop_timer, but waits for the handler to finish.
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

08b81d87

27 8月, 2020 1 次提交

net: Fix some comments · 645f0897

由 Miaohe Lin 提交于 8月 27, 2020

Fix some comments, including wrong function name, duplicated word and so
on.
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

645f0897

24 8月, 2020 1 次提交

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

20 8月, 2020 1 次提交

net: Stop warning about SO_BSDCOMPAT usage · f4ecc748

由 Miaohe Lin 提交于 8月 19, 2020

We've been warning about SO_BSDCOMPAT usage for many years. We may remove
this code completely now.
Suggested-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4ecc748

12 8月, 2020 1 次提交

net: Fix potential memory leak in proto_register() · 0f5907af

由 Miaohe Lin 提交于 8月 10, 2020

If we failed to assign proto idx, we free the twsk_slab_name but forget to
free the twsk_slab. Add a helper function tw_prot_cleanup() to free these
together and also use this helper function in proto_unregister().

Fixes: b45ce321 ("sock: fix potential memory leak in proto_register()")
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f5907af

08 8月, 2020 1 次提交

mm, treewide: rename kzfree() to kfree_sensitive() · 453431a5

由 Waiman Long 提交于 8月 06, 2020

As said by Linus:

  A symmetric naming is only helpful if it implies symmetries in use.
  Otherwise it's actively misleading.

  In "kzalloc()", the z is meaningful and an important part of what the
  caller wants.

  In "kzfree()", the z is actively detrimental, because maybe in the
  future we really _might_ want to use that "memfill(0xdeadbeef)" or
  something. The "zero" part of the interface isn't even _relevant_.

The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.

Rename kzfree() to kfree_sensitive() to follow the example of the recently
added kvfree_sensitive() and make the intention of the API more explicit.
In addition, memzero_explicit() is used to clear the memory to make sure
that it won't get optimized away by the compiler.

The renaming is done by using the command sequence:

  git grep -w --name-only kzfree |\
  xargs sed -i 's/kzfree/kfree_sensitive/'

followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.

[akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
[akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
Suggested-by: NJoe Perches <joe@perches.com>
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

453431a5

06 8月, 2020 1 次提交

net: sock: add sock_set_mark · 84d1c617

由 Alexander Aring 提交于 6月 26, 2020

This patch adds a new socket helper function to set the mark value for a
kernel socket.
Signed-off-by: NAlexander Aring <aahringo@redhat.com>
Signed-off-by: NDavid Teigland <teigland@redhat.com>

84d1c617

25 7月, 2020 5 次提交

net: pass a sockptr_t into ->setsockopt · a7b75c5a

由 Christoph Hellwig 提交于 7月 23, 2020

Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer.  This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154]
Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a7b75c5a

net: switch sock_set_timeout to sockptr_t · c8c1bbb6

由 Christoph Hellwig 提交于 7月 23, 2020

Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8c1bbb6

net: switch sock_set_timeout to sockptr_t · c34645ac

由 Christoph Hellwig 提交于 7月 23, 2020

Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c34645ac

net: switch sock_setbindtodevice to sockptr_t · 5790642b

由 Christoph Hellwig 提交于 7月 23, 2020

Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5790642b

net: switch copy_bpf_fprog_from_user to sockptr_t · b1ea9ff6

由 Christoph Hellwig 提交于 7月 23, 2020

Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1ea9ff6

23 7月, 2020 1 次提交

net: explicitly include <linux/compat.h> in net/core/sock.c · a6c0d093

由 Christoph Hellwig 提交于 7月 22, 2020

The buildbot found a config where the header isn't already implicitly
pulled in, so add an explicit include as well.

Fixes: 8c918ffb ("net: remove compat_sock_common_{get,set}sockopt")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6c0d093

20 7月, 2020 4 次提交

net: make ->{get,set}sockopt in proto_ops optional · a44d9e72

由 Christoph Hellwig 提交于 7月 17, 2020

Just check for a NULL method instead of wiring up
sock_no_{get,set}sockopt.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a44d9e72

net/ipv6: remove compat_ipv6_{get,set}sockopt · 3021ad52

由 Christoph Hellwig 提交于 7月 17, 2020

Handle the few cases that need special treatment in-line using
in_compat_syscall().  This also removes all the now unused
compat_{get,set}sockopt methods.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3021ad52

net: remove compat_sock_common_{get,set}sockopt · 8c918ffb

由 Christoph Hellwig 提交于 7月 17, 2020

Add the compat handling to sock_common_{get,set}sockopt instead,
keyed of in_compat_syscall().  This allow to remove the now unused
->compat_{get,set}sockopt methods from struct proto_ops.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: NStefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c918ffb

net: simplify cBPF setsockopt compat handling · 4d295e54

由 Christoph Hellwig 提交于 7月 17, 2020

Add a helper that copies either a native or compat bpf_fprog from
userspace after verifying the length, and remove the compat setsockopt
handlers that now aren't required.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d295e54

14 7月, 2020 1 次提交

net/compat: Add missing sock updates for SCM_RIGHTS · d9539752

由 Kees Cook 提交于 6月 09, 2020

Add missed sock updates to compat path via a new helper, which will be
used more in coming patches. (The net/core/scm.c code is left as-is here
to assist with -stable backports for the compat path.)

Cc: Christoph Hellwig <hch@lst.de>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 48a87cc2 ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly")
Fixes: d8429506 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly")
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NKees Cook <keescook@chromium.org>

d9539752

10 7月, 2020 1 次提交

inet_diag: support for wider protocol numbers · 3f935c75

由 Paolo Abeni 提交于 7月 09, 2020

After commit bf976514 ("sock: Make sk_protocol a 16-bit value")
the current size of 'sdiag_protocol' is not sufficient to represent
the possible protocol values.

This change introduces a new inet diag request attribute to let
user space specify the relevant protocol number using u32 values.

The attribute is parsed by inet diag core on get/dump command
and the extended protocol value, if available, is preferred to
'sdiag_protocol' to lookup the diag handler.

The parse attributed are exposed to all the diag handlers via
the cb->data.

Note that inet_diag_dump_one_icsk() is left unmodified, as it
will not be used by protocol using the extended attribute.
Suggested-by: NDavid S. Miller <davem@davemloft.net>
Co-developed-by: NChristoph Paasch <cpaasch@apple.com>
Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
Acked-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f935c75

08 7月, 2020 1 次提交

cgroup: fix cgroup_sk_alloc() for sk_clone_lock() · ad0f75e5

由 Cong Wang 提交于 7月 02, 2020

When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b2
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: NCameron Berkenpas <cam@neo-zeon.de>
Reported-by: NPeter Geis <pgwipeout@gmail.com>
Reported-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: NDaniël Sonck <dsonck92@gmail.com>
Reported-by: NZhang Qiang <qiang.zhang@windriver.com>
Tested-by: NCameron Berkenpas <cam@neo-zeon.de>
Tested-by: NPeter Geis <pgwipeout@gmail.com>
Tested-by: NThomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad0f75e5

30 6月, 2020 1 次提交

docs: RCU: Convert rculist_nulls.txt to ReST · 2cdb54c9

由 Mauro Carvalho Chehab 提交于 4月 21, 2020

- Add a SPDX header;
- Adjust document title;
- Some whitespace fixes and new line breaks;
- Mark literal blocks as such;
- Add it to RCU/index.rst.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>

2cdb54c9

25 6月, 2020 1 次提交

sock: Move sock_valbool_flag to header · dfde1d7d

由 Dmitry Yakunin 提交于 6月 20, 2020

This is preparation for usage in bpf_setsockopt.
Signed-off-by: NDmitry Yakunin <zeil@yandex-team.ru>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200620153052.9439-1-zeil@yandex-team.ru

dfde1d7d

24 6月, 2020 1 次提交

net: Do not clear the sock TX queue in sk_set_socket() · 41b14fb8

由 Tariq Toukan 提交于 6月 22, 2020

Clearing the sock TX queue in sk_set_socket() might cause unexpected
out-of-order transmit when called from sock_orphan(), as outstanding
packets can pick a different TX queue and bypass the ones already queued.

This is undesired in general. More specifically, it breaks the in-order
scheduling property guarantee for device-offloaded TLS sockets.

Remove the call to sk_tx_queue_clear() in sk_set_socket(), and add it
explicitly only where needed.

Fixes: e022f0b4 ("net: Introduce sk_tx_queue_mapping")
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41b14fb8

19 6月, 2020 1 次提交

net: increment xmit_recursion level in dev_direct_xmit() · 0ad6f6e7

由 Eric Dumazet 提交于 6月 17, 2020

Back in commit f60e5990 ("ipv6: protect skb->sk accesses
from recursive dereference inside the stack") Hannes added code
so that IPv6 stack would not trust skb->sk for typical cases
where packet goes through 'standard' xmit path (__dev_queue_xmit())

Alas af_packet had a dev_direct_xmit() path that was not
dealing yet with xmit_recursion level.

Also change sk_mc_loop() to dump a stack once only.

Without this patch, syzbot was able to trigger :

[1]
[  153.567378] WARNING: CPU: 7 PID: 11273 at net/core/sock.c:721 sk_mc_loop+0x51/0x70
[  153.567378] Modules linked in: nfnetlink ip6table_raw ip6table_filter iptable_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_defrag_ipv6 iptable_filter macsec macvtap tap macvlan 8021q hsr wireguard libblake2s blake2s_x86_64 libblake2s_generic udp_tunnel ip6_udp_tunnel libchacha20poly1305 poly1305_x86_64 chacha_x86_64 libchacha curve25519_x86_64 libcurve25519_generic netdevsim batman_adv dummy team bridge stp llc w1_therm wire i2c_mux_pca954x i2c_mux cdc_acm ehci_pci ehci_hcd mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core
[  153.567386] CPU: 7 PID: 11273 Comm: b159172088 Not tainted 5.8.0-smp-DEV #273
[  153.567387] RIP: 0010:sk_mc_loop+0x51/0x70
[  153.567388] Code: 66 83 f8 0a 75 24 0f b6 4f 12 b8 01 00 00 00 31 d2 d3 e0 a9 bf ef ff ff 74 07 48 8b 97 f0 02 00 00 0f b6 42 3a 83 e0 01 5d c3 <0f> 0b b8 01 00 00 00 5d c3 0f b6 87 18 03 00 00 5d c0 e8 04 83 e0
[  153.567388] RSP: 0018:ffff95c69bb93990 EFLAGS: 00010212
[  153.567388] RAX: 0000000000000011 RBX: ffff95c6e0ee3e00 RCX: 0000000000000007
[  153.567389] RDX: ffff95c69ae50000 RSI: ffff95c6c30c3000 RDI: ffff95c6c30c3000
[  153.567389] RBP: ffff95c69bb93990 R08: ffff95c69a77f000 R09: 0000000000000008
[  153.567389] R10: 0000000000000040 R11: 00003e0e00026128 R12: ffff95c6c30c3000
[  153.567390] R13: ffff95c6cc4fd500 R14: ffff95c6f84500c0 R15: ffff95c69aa13c00
[  153.567390] FS:  00007fdc3a283700(0000) GS:ffff95c6ff9c0000(0000) knlGS:0000000000000000
[  153.567390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  153.567391] CR2: 00007ffee758e890 CR3: 0000001f9ba20003 CR4: 00000000001606e0
[  153.567391] Call Trace:
[  153.567391]  ip6_finish_output2+0x34e/0x550
[  153.567391]  __ip6_finish_output+0xe7/0x110
[  153.567391]  ip6_finish_output+0x2d/0xb0
[  153.567392]  ip6_output+0x77/0x120
[  153.567392]  ? __ip6_finish_output+0x110/0x110
[  153.567392]  ip6_local_out+0x3d/0x50
[  153.567392]  ipvlan_queue_xmit+0x56c/0x5e0
[  153.567393]  ? ksize+0x19/0x30
[  153.567393]  ipvlan_start_xmit+0x18/0x50
[  153.567393]  dev_direct_xmit+0xf3/0x1c0
[  153.567393]  packet_direct_xmit+0x69/0xa0
[  153.567394]  packet_sendmsg+0xbf0/0x19b0
[  153.567394]  ? plist_del+0x62/0xb0
[  153.567394]  sock_sendmsg+0x65/0x70
[  153.567394]  sock_write_iter+0x93/0xf0
[  153.567394]  new_sync_write+0x18e/0x1a0
[  153.567395]  __vfs_write+0x29/0x40
[  153.567395]  vfs_write+0xb9/0x1b0
[  153.567395]  ksys_write+0xb1/0xe0
[  153.567395]  __x64_sys_write+0x1a/0x20
[  153.567395]  do_syscall_64+0x43/0x70
[  153.567396]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  153.567396] RIP: 0033:0x453549
[  153.567396] Code: Bad RIP value.
[  153.567396] RSP: 002b:00007fdc3a282cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  153.567397] RAX: ffffffffffffffda RBX: 00000000004d32d0 RCX: 0000000000453549
[  153.567397] RDX: 0000000000000020 RSI: 0000000020000300 RDI: 0000000000000003
[  153.567398] RBP: 00000000004d32d8 R08: 0000000000000000 R09: 0000000000000000
[  153.567398] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d32dc
[  153.567398] R13: 00007ffee742260f R14: 00007fdc3a282dc0 R15: 00007fdc3a283700
[  153.567399] ---[ end trace c1d5ae2b1059ec62 ]---

f60e5990 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ad6f6e7

02 6月, 2020 1 次提交

net: Make locking in sock_bindtoindex optional · 8ea204c2

由 Ferenc Fejes 提交于 5月 30, 2020

The sock_bindtoindex intended for kernel wide usage however
it will lock the socket regardless of the context. This modification
relax this behavior optionally: locking the socket will be optional
by calling the sock_bindtoindex with lock_sk = true.

The modification applied to all users of the sock_bindtoindex.
Signed-off-by: NFerenc Fejes <fejes@inf.elte.hu>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu

8ea204c2

30 5月, 2020 1 次提交

net: add a new bind_add method · c0425a42

由 Christoph Hellwig 提交于 5月 29, 2020

The SCTP protocol allows to bind multiple address to a socket.  That
feature is currently only exposed as a socket option.  Add a bind_add
method struct proto that allows to bind additional addresses, and
switch the dlm code to use the method instead of going through the
socket option from kernel space.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0425a42

29 5月, 2020 1 次提交

net: add sock_set_reuseport · fe31a326

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the SO_REUSEPORT sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe31a326

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功