1. 10 8月, 2022 1 次提交
    • M
      ipv6: do not use RT_TOS for IPv6 flowlabel · ab7e2e0d
      Matthias May 提交于
      According to Guillaume Nault RT_TOS should never be used for IPv6.
      
      Quote:
      RT_TOS() is an old macro used to interprete IPv4 TOS as described in
      the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
      code, although, given the current state of the code, most of the
      existing calls have no consequence.
      
      But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
      field to be interpreted the RFC 1349 way. There's no historical
      compatibility to worry about.
      
      Fixes: 571912c6 ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.")
      Acked-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NMatthias May <matthias.may@westermo.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ab7e2e0d
  2. 20 7月, 2022 1 次提交
  3. 19 7月, 2022 1 次提交
  4. 09 6月, 2022 1 次提交
    • W
      ipv6: Fix signed integer overflow in __ip6_append_data · f93431c8
      Wang Yufen 提交于
      Resurrect ubsan overflow checks and ubsan report this warning,
      fix it by change the variable [length] type to size_t.
      
      UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19
      2147479552 + 8567 cannot be represented in type 'int'
      CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1
      Hardware name: linux,dummy-virt (DT)
      Call trace:
        dump_backtrace+0x214/0x230
        show_stack+0x30/0x78
        dump_stack_lvl+0xf8/0x118
        dump_stack+0x18/0x30
        ubsan_epilogue+0x18/0x60
        handle_overflow+0xd0/0xf0
        __ubsan_handle_add_overflow+0x34/0x44
        __ip6_append_data.isra.48+0x1598/0x1688
        ip6_append_data+0x128/0x260
        udpv6_sendmsg+0x680/0xdd0
        inet6_sendmsg+0x54/0x90
        sock_sendmsg+0x70/0x88
        ____sys_sendmsg+0xe8/0x368
        ___sys_sendmsg+0x98/0xe0
        __sys_sendmmsg+0xf4/0x3b8
        __arm64_sys_sendmmsg+0x34/0x48
        invoke_syscall+0x64/0x160
        el0_svc_common.constprop.4+0x124/0x300
        do_el0_svc+0x44/0xc8
        el0_svc+0x3c/0x1e8
        el0t_64_sync_handler+0x88/0xb0
        el0t_64_sync+0x16c/0x170
      
      Changes since v1:
      -Change the variable [length] type to unsigned, as Eric Dumazet suggested.
      Changes since v2:
      -Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested.
      Changes since v3:
      -Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as
      Jakub Kicinski suggested.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f93431c8
  5. 16 5月, 2022 1 次提交
  6. 30 4月, 2022 2 次提交
  7. 13 4月, 2022 1 次提交
  8. 11 4月, 2022 1 次提交
  9. 16 3月, 2022 1 次提交
    • D
      net: Add l3mdev index to flow struct and avoid oif reset for port devices · 40867d74
      David Ahern 提交于
      The fundamental premise of VRF and l3mdev core code is binding a socket
      to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
      Legacy code resets flowi_oif to the l3mdev losing any original port
      device binding. Ben (among others) has demonstrated use cases where the
      original port device binding is important and needs to be retained.
      This patch handles that by adding a new entry to the common flow struct
      that can indicate the l3mdev index for later rule and table matching
      avoiding the need to reset flowi_oif.
      
      In addition to allowing more use cases that require port device binds,
      this patch brings a few datapath simplications:
      
      1. l3mdev_fib_rule_match is only called when walking fib rules and
         always after l3mdev_update_flow. That allows an optimization to bail
         early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
         only that index needs to be checked for the FIB table id.
      
      2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
         (e.g., VRF) device. By resetting flowi_oif only for this case the
         FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
         removing several checks in the datapath. The flowi_iif path can be
         simplified to only be called if the it is not loopback (loopback can
         not be assigned to an L3 domain) and the l3mdev index is not already
         set.
      
      3. Avoid another device lookup in the output path when the fib lookup
         returns a reject failure.
      
      Note: 2 functional tests for local traffic with reject fib rules are
      updated to reflect the new direct failure at FIB lookup time for ping
      rather than the failure on packet path. The current code fails like this:
      
          HINT: Fails since address on vrf device is out of device scope
          COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
          ping: Warning: source address might be selected on device other than: eth1
          PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.
      
          --- 172.16.3.1 ping statistics ---
          1 packets transmitted, 0 received, 100% packet loss, time 0ms
      
      where the test now directly fails:
      
          HINT: Fails since address on vrf device is out of device scope
          COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
          ping: connect: No route to host
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Tested-by: NBen Greear <greearb@candelatech.com>
      Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      40867d74
  10. 12 3月, 2022 1 次提交
  11. 03 3月, 2022 2 次提交
    • M
      net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101
      Martin KaFai Lau 提交于
      Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
      
      If skb->tstamp has the mono delivery_time, clearing it can hurt
      the performance when it finally transmits out to fq@phy-dev.
      
      The earlier patch added a skb->mono_delivery_time bit to
      flag the skb->tstamp carrying the mono delivery_time.
      
      This patch adds skb_clear_tstamp() helper which keeps
      the mono delivery_time and clears everything else.
      
      The delivery_time clearing will be postponed until the stack knows the
      skb will be delivered locally.  It will be done in a latter patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de799101
    • M
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau 提交于
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1ac9c8a
  12. 26 2月, 2022 1 次提交
  13. 18 2月, 2022 1 次提交
    • E
      net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57
      Eric Dumazet 提交于
      UDP sendmsg() can be lockless, this is causing all kinds
      of data races.
      
      This patch converts sk->sk_tskey to remove one of these races.
      
      BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
      
      read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
       __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
       __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000054d -> 0x0000054e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1cdec57
  14. 17 2月, 2022 1 次提交
  15. 28 1月, 2022 7 次提交
  16. 24 1月, 2022 1 次提交
    • J
      xfrm: fix MTU regression · 6596a022
      Jiri Bohac 提交于
      Commit 749439bf ("ipv6: fix udpv6
      sendmsg crash caused by too small MTU") breaks PMTU for xfrm.
      
      A Packet Too Big ICMPv6 message received in response to an ESP
      packet will prevent all further communication through the tunnel
      if the reported MTU minus the ESP overhead is smaller than 1280.
      
      E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
      is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
      in all further packets in the tunnel dropped. A ping through the
      tunnel fails with "ping: sendmsg: Invalid argument".
      
      Apparently the MTU on the xfrm route is smaller than 1280 and
      fails the check inside ip6_setup_cork() added by 749439bf.
      
      We found this by debugging USGv6/ipv6ready failures. Failing
      tests are: "Phase-2 Interoperability Test Scenario IPsec" /
      5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).
      
      Commit b515d263 ("xfrm:
      xfrm_state_mtu should return at least 1280 for ipv6") attempted
      to fix this but caused another regression in TCP MSS calculations
      and had to be reverted.
      
      The patch below fixes the situation by dropping the MTU
      check and instead checking for the underflows described in the
      749439bf commit message.
      Signed-off-by: NJiri Bohac <jbohac@suse.cz>
      Fixes: 749439bf ("ipv6: fix udpv6 sendmsg crash caused by too small MTU")
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      6596a022
  17. 22 11月, 2021 1 次提交
  18. 16 11月, 2021 1 次提交
  19. 16 10月, 2021 1 次提交
  20. 03 8月, 2021 2 次提交
  21. 23 7月, 2021 1 次提交
  22. 21 7月, 2021 1 次提交
  23. 20 7月, 2021 1 次提交
  24. 13 7月, 2021 1 次提交
    • V
      ipv6: allocate enough headroom in ip6_finish_output2() · 5796015f
      Vasily Averin 提交于
      When TEE target mirrors traffic to another interface, sk_buff may
      not have enough headroom to be processed correctly.
      ip_finish_output2() detect this situation for ipv4 and allocates
      new skb with enogh headroom. However ipv6 lacks this logic in
      ip_finish_output2 and it leads to skb_under_panic:
      
       skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
       head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
       ------------[ cut here ]------------
       kernel BUG at net/core/skbuff.c:110!
       invalid opcode: 0000 [#1] SMP PTI
       CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G           OE     5.13.0 #13
       Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
       Workqueue: ipv6_addrconf addrconf_dad_work
       RIP: 0010:skb_panic+0x48/0x4a
       Call Trace:
        skb_push.cold.111+0x10/0x10
        ipgre_header+0x24/0xf0 [ip_gre]
        neigh_connected_output+0xae/0xf0
        ip6_finish_output2+0x1a8/0x5a0
        ip6_output+0x5c/0x110
        nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
        tee_tg6+0x2e/0x40 [xt_TEE]
        ip6t_do_table+0x294/0x470 [ip6_tables]
        nf_hook_slow+0x44/0xc0
        nf_hook.constprop.34+0x72/0xe0
        ndisc_send_skb+0x20d/0x2e0
        ndisc_send_ns+0xd1/0x210
        addrconf_dad_work+0x3c8/0x540
        process_one_work+0x1d1/0x370
        worker_thread+0x30/0x390
        kthread+0x116/0x130
        ret_from_fork+0x22/0x30
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5796015f
  25. 07 7月, 2021 1 次提交
    • N
      ipv6: fix 'disable_policy' for fwd packets · ccd27f05
      Nicolas Dichtel 提交于
      The goal of commit df789fe7 ("ipv6: Provide ipv6 version of
      "disable_policy" sysctl") was to have the disable_policy from ipv4
      available on ipv6.
      However, it's not exactly the same mechanism. On IPv4, all packets coming
      from an interface, which has disable_policy set, bypass the policy check.
      For ipv6, this is done only for local packets, ie for packets destinated to
      an address configured on the incoming interface.
      
      Let's align ipv6 with ipv4 so that the 'disable_policy' sysctl has the same
      effect for both protocols.
      
      My first approach was to create a new kind of route cache entries, to be
      able to set DST_NOPOLICY without modifying routes. This would have added a
      lot of code. Because the local delivery path is already handled, I choose
      to focus on the forwarding path to minimize code churn.
      
      Fixes: df789fe7 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccd27f05
  26. 25 6月, 2021 2 次提交
  27. 04 2月, 2021 1 次提交
  28. 10 1月, 2021 1 次提交
    • A
      net: ipv6: Validate GSO SKB before finish IPv6 processing · b210de4f
      Aya Levin 提交于
      There are cases where GSO segment's length exceeds the egress MTU:
       - Forwarding of a TCP GRO skb, when DF flag is not set.
       - Forwarding of an skb that arrived on a virtualisation interface
         (virtio-net/vhost/tap) with TSO/GSO size set by other network
         stack.
       - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
         interface with a smaller MTU.
       - Arriving GRO skb (or GSO skb in a virtualised environment) that is
         bridged to a NETIF_F_TSO tunnel stacked over an interface with an
         insufficient MTU.
      
      If so:
       - Consume the SKB and its segments.
       - Issue an ICMP packet with 'Packet Too Big' message containing the
         MTU, allowing the source host to reduce its Path MTU appropriately.
      
      Note: These cases are handled in the same manner in IPv4 output finish.
      This patch aligns the behavior of IPv6 and the one of IPv4.
      
      Fixes: 9e508490 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
      Signed-off-by: NAya Levin <ayal@nvidia.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b210de4f
  29. 08 1月, 2021 2 次提交