1. 18 7月, 2023 1 次提交
  2. 22 6月, 2023 1 次提交
    • M
      revert "net: align SO_RCVMARK required privileges with SO_MARK" · a9628e88
      Maciej Żenczykowski 提交于
      This reverts commit 1f86123b ("net: align SO_RCVMARK required
      privileges with SO_MARK") because the reasoning in the commit message
      is not really correct:
        SO_RCVMARK is used for 'reading' incoming skb mark (via cmsg), as such
        it is more equivalent to 'getsockopt(SO_MARK)' which has no priv check
        and retrieves the socket mark, rather than 'setsockopt(SO_MARK) which
        sets the socket mark and does require privs.
      
        Additionally incoming skb->mark may already be visible if
        sysctl_fwmark_reflect and/or sysctl_tcp_fwmark_accept are enabled.
      
        Furthermore, it is easier to block the getsockopt via bpf
        (either cgroup setsockopt hook, or via syscall filters)
        then to unblock it if it requires CAP_NET_RAW/ADMIN.
      
      On Android the socket mark is (among other things) used to store
      the network identifier a socket is bound to.  Setting it is privileged,
      but retrieving it is not.  We'd like unprivileged userspace to be able
      to read the network id of incoming packets (where mark is set via
      iptables [to be moved to bpf])...
      
      An alternative would be to add another sysctl to control whether
      setting SO_RCVMARK is privilged or not.
      (or even a MASK of which bits in the mark can be exposed)
      But this seems like over-engineering...
      
      Note: This is a non-trivial revert, due to later merged commit e42c7bee
      ("bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()")
      which changed both 'ns_capable' into 'sockopt_ns_capable' calls.
      
      Fixes: 1f86123b ("net: align SO_RCVMARK required privileges with SO_MARK")
      Cc: Larysa Zaremba <larysa.zaremba@intel.com>
      Cc: Simon Horman <simon.horman@corigine.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eyal Birger <eyal.birger@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Patrick Rohr <prohr@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Reviewed-by: NSimon Horman <simon.horman@corigine.com>
      Reviewed-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230618103130.51628-1-maze@google.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      a9628e88
  3. 31 5月, 2023 1 次提交
    • V
      udp6: Fix race condition in udp6_sendmsg & connect · 448a5ce1
      Vladislav Efanov 提交于
      Syzkaller got the following report:
      BUG: KASAN: use-after-free in sk_setup_caps+0x621/0x690 net/core/sock.c:2018
      Read of size 8 at addr ffff888027f82780 by task syz-executor276/3255
      
      The function sk_setup_caps (called by ip6_sk_dst_store_flow->
      ip6_dst_store) referenced already freed memory as this memory was
      freed by parallel task in udpv6_sendmsg->ip6_sk_dst_lookup_flow->
      sk_dst_check.
      
                task1 (connect)              task2 (udp6_sendmsg)
              sk_setup_caps->sk_dst_set |
                                        |  sk_dst_check->
                                        |      sk_dst_set
                                        |      dst_release
              sk_setup_caps references  |
              to already freed dst_entry|
      
      The reason for this race condition is: sk_setup_caps() keeps using
      the dst after transferring the ownership to the dst cache.
      
      Found by Linux Verification Center (linuxtesting.org) with syzkaller.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NVladislav Efanov <VEfanov@ispras.ru>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      448a5ce1
  4. 08 4月, 2023 1 次提交
  5. 02 3月, 2023 1 次提交
  6. 15 2月, 2023 1 次提交
  7. 08 2月, 2023 1 次提交
    • K
      txhash: fix sk->sk_txrehash default · c11204c7
      Kevin Yang 提交于
      This code fix a bug that sk->sk_txrehash gets its default enable
      value from sysctl_txrehash only when the socket is a TCP listener.
      
      We should have sysctl_txrehash to set the default sk->sk_txrehash,
      no matter TCP, nor listerner/connector.
      
      Tested by following packetdrill:
        0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
        +0 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 4
        // SO_TXREHASH == 74, default to sysctl_txrehash == 1
        +0 getsockopt(3, SOL_SOCKET, 74, [1], [4]) = 0
        +0 getsockopt(4, SOL_SOCKET, 74, [1], [4]) = 0
      
      Fixes: 26859240 ("txhash: Add socket option to control TX hash rethink behavior")
      Signed-off-by: NKevin Yang <yyd@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c11204c7
  8. 06 2月, 2023 1 次提交
  9. 02 2月, 2023 1 次提交
    • X
      net: add support for ipv4 big tcp · b1a78b9b
      Xin Long 提交于
      Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
      
      Firstly, allow sk->sk_gso_max_size to be set to a value greater than
      GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
      for IPv4 TCP sockets.
      
      Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
      in __ip_local_out() to allow to send BIG TCP packets, and this implies
      that skb->len is the length of a IPv4 packet; On RX path, use skb->len
      as the length of the IPv4 packet when the IP header tot_len is 0 and
      skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
      skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
      need to update these APIs.
      
      Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
      the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
      GRO complete, set IP header tot_len to 0 when the merged packet size
      greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
      on RX path.
      
      Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
      this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
      packets.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b1a78b9b
  10. 23 1月, 2023 1 次提交
  11. 20 12月, 2022 1 次提交
    • G
      net: Introduce sk_use_task_frag in struct sock. · fb87bd47
      Guillaume Nault 提交于
      Sockets that can be used while recursing into memory reclaim, like
      those used by network block devices and file systems, mustn't use
      current->task_frag: if the current process is already using it, then
      the inner memory reclaim call would corrupt the task_frag structure.
      
      To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
      that mustn't use current->task_frag, assuming that those used during
      memory reclaim had their allocation constraints reflected in
      ->sk_allocation.
      
      This unfortunately doesn't cover all cases: in an attempt to remove all
      usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
      ->sk_allocation, and used memalloc_nofs critical sections instead.
      This breaks the sk_page_frag() heuristic since the allocation
      constraints are now stored in current->flags, which sk_page_frag()
      can't read without risking triggering a cache miss and slowing down
      TCP's fast path.
      
      This patch creates a new field in struct sock, named sk_use_task_frag,
      which sockets with memory reclaim constraints can set to false if they
      can't safely use current->task_frag. In such cases, sk_page_frag() now
      always returns the socket's page_frag (->sk_frag). The first user is
      sunrpc, which needs to avoid using current->task_frag but can keep
      ->sk_allocation set to GFP_KERNEL otherwise.
      
      Eventually, it might be possible to simplify sk_page_frag() by only
      testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
      heuristic entirely (assuming other sockets will set ->sk_use_task_frag
      according to their constraints in the future).
      
      The new ->sk_use_task_frag field is placed in a hole in struct sock and
      belongs to a cache line shared with ->sk_shutdown. Therefore it should
      be hot and shouldn't have negative performance impacts on TCP's fast
      path (sk_shutdown is tested just before the while() loop in
      tcp_sendmsg_locked()).
      
      Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fb87bd47
  12. 09 12月, 2022 1 次提交
    • W
      net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP · b534dc46
      Willem de Bruijn 提交于
      Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
      write_seq sockets instead of snd_una.
      
      This should have been the behavior from the start. Because processes
      may now exist that rely on the established behavior, do not change
      behavior of the existing option, but add the right behavior with a new
      flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on
      stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID.
      
      Intuitively the contract is that the counter is zero after the
      setsockopt, so that the next write N results in a notification for
      the last byte N - 1.
      
      On idle sockets snd_una == write_seq and this holds for both. But on
      sockets with data in transmission, snd_una records the unacked offset
      in the stream. This depends on the ACK response from the peer. A
      process cannot learn this in a race free manner (ioctl SIOCOUTQ is one
      racy approach).
      
      write_seq records the offset at the last byte written by the process.
      This is a better starting point. It matches the intuitive contract in
      all circumstances, unaffected by external behavior.
      
      The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
      Move the field in struct sock to avoid growing the socket (for some
      common CONFIG variants). The UAPI interface so_timestamping.flags is
      already int, so 32 bits wide.
      Reported-by: NSotirios Delimanolis <sotodel@meta.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b534dc46
  13. 05 11月, 2022 1 次提交
    • P
      lsm: make security_socket_getpeersec_stream() sockptr_t safe · b10b9c34
      Paul Moore 提交于
      Commit 4ff09db1 ("bpf: net: Change sk_getsockopt() to take the
      sockptr_t argument") made it possible to call sk_getsockopt()
      with both user and kernel address space buffers through the use of
      the sockptr_t type.  Unfortunately at the time of conversion the
      security_socket_getpeersec_stream() LSM hook was written to only
      accept userspace buffers, and in a desire to avoid having to change
      the LSM hook the commit author simply passed the sockptr_t's
      userspace buffer pointer.  Since the only sk_getsockopt() callers
      at the time of conversion which used kernel sockptr_t buffers did
      not allow SO_PEERSEC, and hence the
      security_socket_getpeersec_stream() hook, this was acceptable but
      also very fragile as future changes presented the possibility of
      silently passing kernel space pointers to the LSM hook.
      
      There are several ways to protect against this, including careful
      code review of future commits, but since relying on code review to
      catch bugs is a recipe for disaster and the upstream eBPF maintainer
      is "strongly against defensive programming", this patch updates the
      LSM hook, and all of the implementations to support sockptr_t and
      safely handle both user and kernel space buffers.
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Acked-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b10b9c34
  14. 25 10月, 2022 1 次提交
    • K
      soreuseport: Fix socket selection for SO_INCOMING_CPU. · b261eda8
      Kuniyuki Iwashima 提交于
      Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
      with setsockopt(SO_REUSEPORT) since v4.6.
      
      With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
      build a highly efficient server application.
      
      setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
      or UDP socket, and then incoming packets processed on the CPU will
      likely be distributed to the socket.  Technically, a socket could
      even receive packets handled on another CPU if no sockets in the
      reuseport group have the same CPU receiving the flow.
      
      The logic exists in compute_score() so that a socket will get a higher
      score if it has the same CPU with the flow.  However, the score gets
      ignored after the blamed two commits, which introduced a faster socket
      selection algorithm for SO_REUSEPORT.
      
      This patch introduces a counter of sockets with SO_INCOMING_CPU in
      a reuseport group to check if we should iterate all sockets to find
      a proper one.  We increment the counter when
      
        * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT
      
        * enabling SO_INCOMING_CPU if the socket is in a reuseport group
      
      Also, we decrement it when
      
        * detaching a socket out of the group to apply SO_INCOMING_CPU to
          migrated TCP requests
      
        * disabling SO_INCOMING_CPU if the socket is in a reuseport group
      
      When the counter reaches 0, we can get back to the O(1) selection
      algorithm.
      
      The overall changes are negligible for the non-SO_INCOMING_CPU case,
      and the only notable thing is that we have to update sk_incomnig_cpu
      under reuseport_lock.  Otherwise, the race prevents transitioning to
      the O(n) algorithm and results in the wrong socket selection.
      
       cpu1 (setsockopt)               cpu2 (listen)
      +-----------------+             +-------------+
      
      lock_sock(sk1)                  lock_sock(sk2)
      
      reuseport_update_incoming_cpu(sk1, val)
      .
      |  /* set CPU as 0 */
      |- WRITE_ONCE(sk1->incoming_cpu, val)
      |
      |                               spin_lock_bh(&reuseport_lock)
      |                               reuseport_grow(sk2, reuse)
      |                               .
      |                               |- more_socks_size = reuse->max_socks * 2U;
      |                               |- if (more_socks_size > U16_MAX &&
      |                               |       reuse->num_closed_socks)
      |                               |  .
      |                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
      |                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
      |                               |     .
      |                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
      |                               |        .
      |                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
      |                               |        |   * without lock_sock().
      |                               |        |   */
      |                               |        `- if (sk1->sk_incoming_cpu >= 0)
      |                               |           .
      |                               |           |  /* decrement not-yet-incremented
      |                               |           |   * count, which is never incremented.
      |                               |           |   */
      |                               |           `- __reuseport_put_incoming_cpu(reuse);
      |                               |
      |                               `- spin_lock_bh(&reuseport_lock)
      |
      |- spin_lock_bh(&reuseport_lock)
      |
      |- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
      |- if (!reuse)
      |  .
      |  |  /* Cannot increment reuse->incoming_cpu. */
      |  `- goto out;
      |
      `- spin_unlock_bh(&reuseport_lock)
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: c125e80b ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NKazuho Oku <kazuhooku@gmail.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      b261eda8
  15. 24 10月, 2022 2 次提交
  16. 13 10月, 2022 1 次提交
  17. 03 9月, 2022 3 次提交
  18. 24 8月, 2022 3 次提交
  19. 19 8月, 2022 4 次提交
  20. 06 7月, 2022 1 次提交
  21. 13 6月, 2022 1 次提交
  22. 11 6月, 2022 3 次提交
  23. 10 6月, 2022 1 次提交
  24. 16 5月, 2022 2 次提交
    • E
      net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if · e5fccaa1
      Eric Dumazet 提交于
      sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
      because other cpus/threads might locklessly read this field.
      
      sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
      because they run without socket lock held.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5fccaa1
    • A
      net: allow gso_max_size to exceed 65536 · 7c4e983c
      Alexander Duyck 提交于
      The code for gso_max_size was added originally to allow for debugging and
      workaround of buggy devices that couldn't support TSO with blocks 64K in
      size. The original reason for limiting it to 64K was because that was the
      existing limits of IPv4 and non-jumbogram IPv6 length fields.
      
      With the addition of Big TCP we can remove this limit and allow the value
      to potentially go up to UINT_MAX and instead be limited by the tso_max_size
      value.
      
      So in order to support this we need to go through and clean up the
      remaining users of the gso_max_size value so that the values will cap at
      64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
      so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
      limit for GSO_MAX_SIZE.
      
      v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                     in a new sk_trim_gso_size() helper.
                     netif_set_tso_max_size() caps the requested TSO size
                     with GSO_MAX_SIZE.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4e983c
  25. 06 5月, 2022 1 次提交
  26. 01 5月, 2022 3 次提交
  27. 30 4月, 2022 1 次提交