1. 20 7月, 2022 1 次提交
  2. 16 7月, 2022 1 次提交
    • K
      tcp/udp: Make early_demux back namespacified. · 11052589
      Kuniyuki Iwashima 提交于
      Commit e21145a9 ("ipv4: namespacify ip_early_demux sysctl knob") made
      it possible to enable/disable early_demux on a per-netns basis.  Then, we
      introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
      TCP/UDP in commit dddb64bc ("net: Add sysctl to toggle early demux for
      tcp and udp").  However, the .proc_handler() was wrong and actually
      disabled us from changing the behaviour in each netns.
      
      We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
      .early_demux() handler is not NULL.  When we toggle (tcp|udp)_early_demux,
      the change itself is saved in each netns variable, but the .early_demux()
      handler is a global variable, so the handler is switched based on the
      init_net's sysctl variable.  Thus, netns (tcp|udp)_early_demux knobs have
      nothing to do with the logic.  Whether we CAN execute proto .early_demux()
      is always decided by init_net's sysctl knob, and whether we DO it or not is
      by each netns ip_early_demux knob.
      
      This patch namespacifies (tcp|udp)_early_demux again.  For now, the users
      of the .early_demux() handler are TCP and UDP only, and they are called
      directly to avoid retpoline.  So, we can remove the .early_demux() handler
      from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
      If another proto needs .early_demux(), we can restore it at that time.
      
      Fixes: dddb64bc ("net: Add sysctl to toggle early demux for tcp and udp")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      11052589
  3. 12 4月, 2022 1 次提交
    • O
      net: remove noblock parameter from recvmsg() entities · ec095263
      Oliver Hartkopp 提交于
      The internal recvmsg() functions have two parameters 'flags' and 'noblock'
      that were merged inside skb_recv_datagram(). As a follow up patch to commit
      f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()")
      this patch removes the separate 'noblock' parameter for recvmsg().
      
      Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
      'noblock' parameters are unnecessarily split up with e.g.
      
      err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                                 flags & ~MSG_DONTWAIT, &addr_len);
      
      or in
      
      err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                            sk, msg, size, flags & MSG_DONTWAIT,
                            flags & ~MSG_DONTWAIT, &addr_len);
      
      instead of simply using only flags all the time and check for MSG_DONTWAIT
      where needed (to preserve for the formerly separated no(n)block condition).
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      ec095263
  4. 16 11月, 2021 1 次提交
  5. 26 10月, 2021 1 次提交
    • C
      net: multicast: calculate csum of looped-back and forwarded packets · 9122a70a
      Cyril Strejc 提交于
      During a testing of an user-space application which transmits UDP
      multicast datagrams and utilizes multicast routing to send the UDP
      datagrams out of defined network interfaces, I've found a multicast
      router does not fill-in UDP checksum into locally produced, looped-back
      and forwarded UDP datagrams, if an original output NIC the datagrams
      are sent to has UDP TX checksum offload enabled.
      
      The datagrams are sent malformed out of the NIC the datagrams have been
      forwarded to.
      
      It is because:
      
      1. If TX checksum offload is enabled on the output NIC, UDP checksum
         is not calculated by kernel and is not filled into skb data.
      
      2. dev_loopback_xmit(), which is called solely by
         ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
         unconditionally.
      
      3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
         CHECKSUM_COMPLETE"), the ip_summed value is preserved during
         forwarding.
      
      4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
         a packet egress.
      
      The minimum fix in dev_loopback_xmit():
      
      1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
         case when the original output NIC has TX checksum offload enabled.
         The effects are:
      
           a) If the forwarding destination interface supports TX checksum
              offloading, the NIC driver is responsible to fill-in the
              checksum.
      
           b) If the forwarding destination interface does NOT support TX
              checksum offloading, checksums are filled-in by kernel before
              skb is submitted to the NIC driver.
      
           c) For local delivery, checksum validation is skipped as in the
              case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
      
      2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
         means, for CHECKSUM_NONE, the behavior is unmodified and is there
         to skip a looped-back packet local delivery checksum validation.
      Signed-off-by: NCyril Strejc <cyril.strejc@skoda.cz>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9122a70a
  6. 12 4月, 2021 1 次提交
  7. 02 4月, 2021 2 次提交
  8. 31 3月, 2021 1 次提交
    • P
      udp: fixup csum for GSO receive slow path · 000ac44d
      Paolo Abeni 提交于
      When UDP packets generated locally by a socket with UDP_SEGMENT
      traverse the following path:
      
      UDP tunnel(xmit) -> veth (segmentation) -> veth (gro) ->
      	UDP tunnel (rx) -> UDP socket (no UDP_GRO)
      
      ip_summed will be set to CHECKSUM_PARTIAL at creation time and
      such checksum mode will be preserved in the above path up to the
      UDP tunnel receive code where we have:
      
       __iptunnel_pull_header() -> skb_pull_rcsum() ->
      skb_postpull_rcsum() -> __skb_postpull_rcsum()
      
      The latter will convert the skb to CHECKSUM_NONE.
      
      The UDP GSO packet will be later segmented as part of the rx socket
      receive operation, and will present a CHECKSUM_NONE after segmentation.
      
      Additionally the segmented packets UDP CB still refers to the original
      GSO packet len. Overall that causes unexpected/wrong csum validation
      errors later in the UDP receive path.
      
      We could possibly address the issue with some additional checks and
      csum mangling in the UDP tunnel code. Since the issue affects only
      this UDP receive slow path, let's set a suitable csum status there.
      
      Note that SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST packets lacking an UDP
      encapsulation present a valid checksum when landing to udp_queue_rcv_skb(),
      as the UDP checksum has been validated by the GRO engine.
      
      v2 -> v3:
       - even more verbose commit message and comments
      
      v1 -> v2:
       - restrict the csum update to the packets strictly needing them
       - hopefully clarify the commit message and code comments
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      000ac44d
  9. 27 2月, 2021 1 次提交
  10. 05 2月, 2021 2 次提交
    • L
      ipv6: move udp declarations to net/udp.h · f9a4719c
      Leon Romanovsky 提交于
      Fix the following compilation warning:
      
      net/ipv6/udp.c:1031:30: warning: no previous prototype for 'udp_v6_early_demux' [-Wmissing-prototypes]
       1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)
            |                              ^~~~~~~~~~~~~~~~~~
      net/ipv6/udp.c:1072:29: warning: no previous prototype for 'udpv6_rcv' [-Wmissing-prototypes]
       1072 | INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
            |                             ^~~~~~~~~
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f9a4719c
    • X
      udp: call udp_encap_enable for v6 sockets when enabling encap · a4a600dd
      Xin Long 提交于
      When enabling encap for a ipv6 socket without udp_encap_needed_key
      increased, UDP GRO won't work for v4 mapped v6 address packets as
      sk will be NULL in udp4_gro_receive().
      
      This patch is to enable it by increasing udp_encap_needed_key for
      v6 sockets in udp_tunnel_encap_enable(), and correspondingly
      decrease udp_encap_needed_key in udpv6_destroy_sock().
      
      v1->v2:
        - add udp_encap_disable() and export it.
      v2->v3:
        - add the change for rxrpc and bareudp into one patch, as Alex
          suggested.
      v3->v4:
        - move rxrpc part to another patch.
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      a4a600dd
  11. 02 2月, 2021 1 次提交
  12. 11 11月, 2020 1 次提交
  13. 25 7月, 2020 1 次提交
  14. 25 6月, 2020 1 次提交
  15. 24 6月, 2020 1 次提交
    • E
      udp: move gro declarations to net/udp.h · 6db69328
      Eric Dumazet 提交于
      This removes following warnings :
        CC      net/ipv4/udp_offload.o
      net/ipv4/udp_offload.c:504:17: warning: no previous prototype for 'udp4_gro_receive' [-Wmissing-prototypes]
        504 | struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv4/udp_offload.c:584:29: warning: no previous prototype for 'udp4_gro_complete' [-Wmissing-prototypes]
        584 | INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      
        CHECK   net/ipv6/udp_offload.c
      net/ipv6/udp_offload.c:115:16: warning: symbol 'udp6_gro_receive' was not declared. Should it be static?
      net/ipv6/udp_offload.c:148:29: warning: symbol 'udp6_gro_complete' was not declared. Should it be static?
        CC      net/ipv6/udp_offload.o
      net/ipv6/udp_offload.c:115:17: warning: no previous prototype for 'udp6_gro_receive' [-Wmissing-prototypes]
        115 | struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv6/udp_offload.c:148:29: warning: no previous prototype for 'udp6_gro_complete' [-Wmissing-prototypes]
        148 | INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db69328
  16. 10 3月, 2020 1 次提交
  17. 30 1月, 2020 1 次提交
  18. 28 1月, 2020 1 次提交
    • W
      udp: segment looped gso packets correctly · 6cd021a5
      Willem de Bruijn 提交于
      Multicast and broadcast packets can be looped from egress to ingress
      pre segmentation with dev_loopback_xmit. That function unconditionally
      sets ip_summed to CHECKSUM_UNNECESSARY.
      
      udp_rcv_segment segments gso packets in the udp rx path. Segmentation
      usually executes on egress, and does not expect packets of this type.
      __udp_gso_segment interprets !CHECKSUM_PARTIAL as CHECKSUM_NONE. But
      the offsets are not correct for gso_make_checksum.
      
      UDP GSO packets are of type CHECKSUM_PARTIAL, with their uh->check set
      to the correct pseudo header checksum. Reset ip_summed to this type.
      (CHECKSUM_PARTIAL is allowed on ingress, see comments in skbuff.h)
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Fixes: cf329aa4 ("udp: cope with UDP GRO packet misdirection")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cd021a5
  19. 27 1月, 2020 1 次提交
  20. 31 8月, 2019 1 次提交
  21. 31 5月, 2019 2 次提交
  22. 09 4月, 2019 1 次提交
  23. 09 11月, 2018 1 次提交
    • S
      net: Convert protocol error handlers from void to int · 32bbd879
      Stefano Brivio 提交于
      We'll need this to handle ICMP errors for tunnels without a sending socket
      (i.e. FoU and GUE). There, we might have to look up different types of IP
      tunnels, registered as network protocols, before we get a match, so we
      want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
      inet_protos and inet6_protos. These error codes will be used in the next
      patch.
      
      For consistency, return sensible error codes in protocol error handlers
      whenever handlers can't handle errors because, even if valid, they don't
      match a protocol or any of its states.
      
      This has no effect on existing error handling paths.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32bbd879
  24. 08 11月, 2018 2 次提交
    • P
      udp: cope with UDP GRO packet misdirection · cf329aa4
      Paolo Abeni 提交于
      In some scenarios, the GRO engine can assemble an UDP GRO packet
      that ultimately lands on a non GRO-enabled socket.
      This patch tries to address the issue explicitly checking for the UDP
      socket features before enqueuing the packet, and eventually segmenting
      the unexpected GRO packet, as needed.
      
      We must also cope with re-insertion requests: after segmentation the
      UDP code calls the helper introduced by the previous patches, as needed.
      
      Segmentation is performed by a common helper, which takes care of
      updating socket and protocol stats is case of failure.
      
      rfc v3 -> v1
       - fix compile issues with rxrpc
       - when gso_segment returns NULL, treat is as an error
       - added 'ipv4' argument to udp_rcv_segment()
      
      rfc v2 -> rfc v3
       - moved udp_rcv_segment() into net/udp.h, account errors to socket
         and ns, always return NULL or segs list
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf329aa4
    • M
      net: ensure unbound datagram socket to be chosen when not in a VRF · 6da5b0f0
      Mike Manning 提交于
      Ensure an unbound datagram skt is chosen when not in a VRF. The check
      for a device match in compute_score() for UDP must be performed when
      there is no device match. For this, a failure is returned when there is
      no device match. This ensures that bound sockets are never selected,
      even if there is no unbound socket.
      
      Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
      packets are currently blocked, as flowi6_oif was set to that of the
      master vrf device, and the ipi6_ifindex is that of the slave device.
      Allow these packets to be sent by checking the device with ipi6_ifindex
      has the same L3 scope as that of the bound device of the skt, which is
      the master vrf device. Note that this check always succeeds if the skt
      is unbound.
      
      Even though the right datagram skt is now selected by compute_score(),
      a different skt is being returned that is bound to the wrong vrf. The
      difference between these and stream sockets is the handling of the skt
      option for SO_REUSEPORT. While the handling when adding a skt for reuse
      correctly checks that the bound device of the skt is a match, the skts
      in the hashslot are already incorrect. So for the same hash, a skt for
      the wrong vrf may be selected for the required port. The root cause is
      that the skt is immediately placed into a slot when it is created,
      but when the skt is then bound using SO_BINDTODEVICE, it remains in the
      same slot. The solution is to move the skt to the correct slot by
      forcing a rehash.
      Signed-off-by: NMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da5b0f0
  25. 06 10月, 2018 1 次提交
  26. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  27. 26 6月, 2018 1 次提交
    • D
      net: Convert GRO SKB handling to list_head. · d4546c25
      David Miller 提交于
      Manage pending per-NAPI GRO packets via list_head.
      
      Return an SKB pointer from the GRO receive handlers.  When GRO receive
      handlers return non-NULL, it means that this SKB needs to be completed
      at this time and removed from the NAPI queue.
      
      Several operations are greatly simplified by this transformation,
      especially timing out the oldest SKB in the list when gro_count
      exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
      in reverse order.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4546c25
  28. 09 6月, 2018 1 次提交
    • P
      udp: fix rx queue len reported by diag and proc interface · 6c206b20
      Paolo Abeni 提交于
      After commit 6b229cf7 ("udp: add batching to udp_rmem_release()")
      the sk_rmem_alloc field does not measure exactly anymore the
      receive queue length, because we batch the rmem release. The issue
      is really apparent only after commit 0d4a6608 ("udp: do rmem bulk
      free even if the rx sk queue is empty"): the user space can easily
      check for an empty socket with not-0 queue length reported by the 'ss'
      tool or the procfs interface.
      
      We need to use a custom UDP helper to report the correct queue length,
      taking into account the forward allocation deficit.
      
      Reported-by: trevor.francis@46labs.com
      Fixes: 6b229cf7 ("UDP: add batching to udp_rmem_release()")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c206b20
  29. 26 5月, 2018 1 次提交
  30. 16 5月, 2018 2 次提交
  31. 09 5月, 2018 2 次提交
  32. 27 4月, 2018 2 次提交
    • W
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn 提交于
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8de857
    • W
      udp: add udp gso · ee80d1eb
      Willem de Bruijn 提交于
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80d1eb
  33. 31 3月, 2018 1 次提交
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e