1. 11 3月, 2022 1 次提交
  2. 04 3月, 2022 7 次提交
  3. 03 3月, 2022 7 次提交
    • M
      bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress · 7449197d
      Martin KaFai Lau 提交于
      The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
      as a (rcv) timestamp which currently could either be 0 (not available)
      or ktime_get_real().  This patch is to backward compatible with the
      (rcv) timestamp expectation at ingress.  If the skb->tstamp has
      the delivery_time, the bpf insn rewrite will read 0 for tc-bpf
      running at ingress as it is not available.  When writing at ingress,
      it will also clear the skb->mono_delivery_time bit.
      
      /* BPF_READ: a = __sk_buff->tstamp */
      if (!skb->tc_at_ingress || !skb->mono_delivery_time)
      	a = skb->tstamp;
      else
      	a = 0
      
      /* BPF_WRITE: __sk_buff->tstamp = a */
      if (skb->tc_at_ingress)
      	skb->mono_delivery_time = 0;
      skb->tstamp = a;
      
      [ A note on the BPF_CGROUP_INET_INGRESS which can also access
        skb->tstamp.  At that point, the skb is delivered locally
        and skb_clear_delivery_time() has already been done,
        so the skb->tstamp will only have the (rcv) timestamp. ]
      
      If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
      has to be cleared also.  It could be done together during
      convert_ctx_access().  However, the latter patch will also expose
      the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
      Changing the delivery_time_type in the background may surprise
      the user, e.g. the 2nd read on __sk_buff->delivery_time_type
      may need a READ_ONCE() to avoid compiler optimization.  Thus,
      in expecting the needs in the latter patch, this patch does a
      check on !skb->tstamp after running the tc-bpf and clears the
      skb->mono_delivery_time bit if needed.  The earlier discussion
      on v4 [0].
      
      The bpf insn rewrite requires the skb's mono_delivery_time bit and
      tc_at_ingress bit.  They are moved up in sk_buff so that bpf rewrite
      can be done at a fixed offset.  tc_skip_classify is moved together with
      tc_at_ingress.  To get one bit for mono_delivery_time, csum_not_inet is
      moved down and this bit is currently used by sctp.
      
      [0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7449197d
    • M
      net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option · b6561f84
      Martin KaFai Lau 提交于
      IOAM is a hop-by-hop option with a temporary iana allocation (49).
      Since it is hop-by-hop, it is done before the input routing decision.
      One of the traced data field is the (rcv) timestamp.
      
      When the locally generated skb is looping from egress to ingress over
      a virtual interface (e.g. veth, loopback...), skb->tstamp may have the
      delivery time before it is known that it will be delivered locally
      and received by another sk.
      
      Like handling the network tapping (tcpdump) in the earlier patch,
      this patch gets the timestamp if needed without over-writing the
      delivery_time in the skb->tstamp.  skb_tstamp_cond() is added to do the
      ktime_get_real() with an extra cond arg to check on top of the
      netstamp_needed_key static key.  skb_tstamp_cond() will also be used in
      a latter patch and it needs the netstamp_needed_key check.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6561f84
    • M
      net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() · d98d58a0
      Martin KaFai Lau 提交于
      The previous patches handled the delivery_time before sch_handle_ingress().
      
      This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
      and also clear it with skb_clear_delivery_time() after
      sch_handle_ingress().  This will make the bpf_redirect_*()
      to keep the mono delivery_time and used by a qdisc (fq) of
      the egress-ing interface.
      
      A latter patch will postpone the skb_clear_delivery_time() until the
      stack learns that the skb is being delivered locally and that will
      make other kernel forwarding paths (ip[6]_forward) able to keep
      the delivery_time also.  Thus, like the previous patches on using
      the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
      is not limited within the CONFIG_NET_INGRESS to avoid too many code
      churns among this set.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d98d58a0
    • M
      net: Clear mono_delivery_time bit in __skb_tstamp_tx() · d93376f5
      Martin KaFai Lau 提交于
      In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
      the sk_error_queue.  The outgoing skb may have the mono delivery_time
      while the (rcv) timestamp is expected for the clone, so the
      skb->mono_delivery_time bit needs to be cleared from the clone.
      
      This patch adds the skb->mono_delivery_time clearing to the existing
      __net_timestamp() and use it in __skb_tstamp_tx().
      The __net_timestamp() fast path usage in dev.c is changed to directly
      call ktime_get_real() since the mono_delivery_time bit is not set at
      that point.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d93376f5
    • M
      net: Handle delivery_time in skb->tstamp during network tapping with af_packet · 27942a15
      Martin KaFai Lau 提交于
      A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
      skb_clear_tstamp() will then keep this delivery_time during forwarding.
      
      This patch is to make the network tapping (with af_packet) to handle
      the delivery_time stored in skb->tstamp.
      
      Regardless of tapping at the ingress or egress,  the tapped skb is
      received by the af_packet socket, so it is ingress to the af_packet
      socket and it expects the (rcv) timestamp.
      
      When tapping at egress, dev_queue_xmit_nit() is used.  It has already
      expected skb->tstamp may have delivery_time,  so it does
      skb_clone()+net_timestamp_set() to ensure the cloned skb has
      the (rcv) timestamp before passing to the af_packet sk.
      This patch only adds to clear the skb->mono_delivery_time
      bit in net_timestamp_set().
      
      When tapping at ingress, it currently expects the skb->tstamp is either 0
      or the (rcv) timestamp.  Meaning, the tapping at ingress path
      has already expected the skb->tstamp could be 0 and it will get
      the (rcv) timestamp by ktime_get_real() when needed.
      
      There are two cases for tapping at ingress:
      
      One case is af_packet queues the skb to its sk_receive_queue.
      The skb is either not shared or new clone created.  The newly
      added skb_clear_delivery_time() is called to clear the
      delivery_time (if any) and set the (rcv) timestamp if
      needed before the skb is queued to the sk_receive_queue.
      
      Another case, the ingress skb is directly copied to the rx_ring
      and tpacket_get_timestamp() is used to get the (rcv) timestamp.
      The newly added skb_tstamp() is used in tpacket_get_timestamp()
      to check the skb->mono_delivery_time bit before returning skb->tstamp.
      As mentioned earlier, the tapping@ingress has already expected
      the skb may not have the (rcv) timestamp (because no sk has asked
      for it) and has handled this case by directly calling ktime_get_real().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27942a15
    • M
      net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101
      Martin KaFai Lau 提交于
      Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
      
      If skb->tstamp has the mono delivery_time, clearing it can hurt
      the performance when it finally transmits out to fq@phy-dev.
      
      The earlier patch added a skb->mono_delivery_time bit to
      flag the skb->tstamp carrying the mono delivery_time.
      
      This patch adds skb_clear_tstamp() helper which keeps
      the mono delivery_time and clears everything else.
      
      The delivery_time clearing will be postponed until the stack knows the
      skb will be delivered locally.  It will be done in a latter patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de799101
    • M
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau 提交于
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1ac9c8a
  4. 26 2月, 2022 2 次提交
  5. 23 2月, 2022 2 次提交
    • E
      net: preserve skb_end_offset() in skb_unclone_keeptruesize() · 2b88cba5
      Eric Dumazet 提交于
      syzbot found another way to trigger the infamous WARN_ON_ONCE(delta < len)
      in skb_try_coalesce() [1]
      
      I was able to root cause the issue to kfence.
      
      When kfence is in action, the following assertion is no longer true:
      
      int size = xxxx;
      void *ptr1 = kmalloc(size, gfp);
      void *ptr2 = kmalloc(size, gfp);
      
      if (ptr1 && ptr2)
      	ASSERT(ksize(ptr1) == ksize(ptr2));
      
      We attempted to fix these issues in the blamed commits, but forgot
      that TCP was possibly shifting data after skb_unclone_keeptruesize()
      has been used, notably from tcp_retrans_try_collapse().
      
      So we not only need to keep same skb->truesize value,
      we also need to make sure TCP wont fill new tailroom
      that pskb_expand_head() was able to get from a
      addr = kmalloc(...) followed by ksize(addr)
      
      Split skb_unclone_keeptruesize() into two parts:
      
      1) Inline skb_unclone_keeptruesize() for the common case,
         when skb is not cloned.
      
      2) Out of line __skb_unclone_keeptruesize() for the 'slow path'.
      
      WARNING: CPU: 1 PID: 6490 at net/core/skbuff.c:5295 skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
      Modules linked in:
      CPU: 1 PID: 6490 Comm: syz-executor161 Not tainted 5.17.0-rc4-syzkaller-00229-g4f12b742 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
      Code: bf 01 00 00 00 0f b7 c0 89 c6 89 44 24 20 e8 62 24 4e fa 8b 44 24 20 83 e8 01 0f 85 e5 f0 ff ff e9 87 f4 ff ff e8 cb 20 4e fa <0f> 0b e9 06 f9 ff ff e8 af b2 95 fa e9 69 f0 ff ff e8 95 b2 95 fa
      RSP: 0018:ffffc900063af268 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 00000000ffffffd5 RCX: 0000000000000000
      RDX: ffff88806fc05700 RSI: ffffffff872abd55 RDI: 0000000000000003
      RBP: ffff88806e675500 R08: 00000000ffffffd5 R09: 0000000000000000
      R10: ffffffff872ab659 R11: 0000000000000000 R12: ffff88806dd554e8
      R13: ffff88806dd9bac0 R14: ffff88806dd9a2c0 R15: 0000000000000155
      FS:  00007f18014f9700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020002000 CR3: 000000006be7a000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_try_coalesce net/ipv4/tcp_input.c:4651 [inline]
       tcp_try_coalesce+0x393/0x920 net/ipv4/tcp_input.c:4630
       tcp_queue_rcv+0x8a/0x6e0 net/ipv4/tcp_input.c:4914
       tcp_data_queue+0x11fd/0x4bb0 net/ipv4/tcp_input.c:5025
       tcp_rcv_established+0x81e/0x1ff0 net/ipv4/tcp_input.c:5947
       tcp_v4_do_rcv+0x65e/0x980 net/ipv4/tcp_ipv4.c:1719
       sk_backlog_rcv include/net/sock.h:1037 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2779
       release_sock+0x54/0x1b0 net/core/sock.c:3311
       sk_wait_data+0x177/0x450 net/core/sock.c:2821
       tcp_recvmsg_locked+0xe28/0x1fd0 net/ipv4/tcp.c:2457
       tcp_recvmsg+0x137/0x610 net/ipv4/tcp.c:2572
       inet_recvmsg+0x11b/0x5e0 net/ipv4/af_inet.c:850
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c4777efa ("net: add and use skb_unclone_keeptruesize() helper")
      Fixes: 097b9146 ("net: fix up truesize of cloned skb in skb_prepare_for_shift()")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2b88cba5
    • E
      net: add skb_set_end_offset() helper · 763087da
      Eric Dumazet 提交于
      We have multiple places where this helper is convenient,
      and plan using it in the following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      763087da
  6. 20 2月, 2022 5 次提交
  7. 07 2月, 2022 6 次提交
  8. 28 1月, 2022 1 次提交
  9. 27 1月, 2022 1 次提交
  10. 22 1月, 2022 1 次提交
  11. 10 1月, 2022 4 次提交
  12. 18 12月, 2021 2 次提交
  13. 10 12月, 2021 1 次提交
    • K
      skbuff: Extract list pointers to silence compiler warnings · 1a2fb220
      Kees Cook 提交于
      Under both -Warray-bounds and the object_size sanitizer, the compiler is
      upset about accessing prev/next of sk_buff when the object it thinks it
      is coming from is sk_buff_head. The warning is a false positive due to
      the compiler taking a conservative approach, opting to warn at casting
      time rather than access time.
      
      However, in support of enabling -Warray-bounds globally (which has
      found many real bugs), arrange things for sk_buff so that the compiler
      can unambiguously see that there is no intention to access anything
      except prev/next.  Introduce and cast to a separate struct sk_buff_list,
      which contains _only_ the first two fields, silencing the warnings:
      
      In file included from ./include/net/net_namespace.h:39,
                       from ./include/linux/netdevice.h:37,
                       from net/core/netpoll.c:17:
      net/core/netpoll.c: In function 'refill_skbs':
      ./include/linux/skbuff.h:2086:9: warning: array subscript 'struct sk_buff[0]' is partly outside array bounds of 'struct sk_buff_head[1]' [-Warray-bounds]
       2086 |         __skb_insert(newsk, next->prev, next, list);
            |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      net/core/netpoll.c:49:28: note: while referencing 'skb_pool'
         49 | static struct sk_buff_head skb_pool;
            |                            ^~~~~~~~
      
      This change results in no executable instruction differences.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20211207062758.2324338-1-keescook@chromium.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1a2fb220