1. 29 9月, 2022 1 次提交
  2. 07 9月, 2022 1 次提交
    • M
      net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM · 9cb252c4
      Menglong Dong 提交于
      As Eric reported, the 'reason' field is not presented when trace the
      kfree_skb event by perf:
      
      $ perf record -e skb:kfree_skb -a sleep 10
      $ perf script
        ip_defrag 14605 [021]   221.614303:   skb:kfree_skb:
        skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1
        reason:
      
      The cause seems to be passing kernel address directly to TP_printk(),
      which is not right. As the enum 'skb_drop_reason' is not exported to
      user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason
      string from the 'reason' field, which is a number.
      
      Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used
      to define the trace enum by TRACE_DEFINE_ENUM(). With the help of
      DEFINE_DROP_REASON(), now we can remove the auto-generate that we
      introduced in the commit ec43908d
      ("net: skb: use auto-generation to convert skb drop reason to string"),
      and define the string array 'drop_reasons'.
      
      Hmmmm...now we come back to the situation that have to maintain drop
      reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they
      are both in dropreason.h, which makes it easier.
      
      After this commit, now the format of kfree_skb is like this:
      
      $ cat /tracing/events/skb/kfree_skb/format
      name: kfree_skb
      ID: 1524
      format:
              field:unsigned short common_type;       offset:0;       size:2; signed:0;
              field:unsigned char common_flags;       offset:2;       size:1; signed:0;
              field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
              field:int common_pid;   offset:4;       size:4; signed:1;
      
              field:void * skbaddr;   offset:8;       size:8; signed:0;
              field:void * location;  offset:16;      size:8; signed:0;
              field:unsigned short protocol;  offset:24;      size:2; signed:0;
              field:enum skb_drop_reason reason;      offset:28;      size:4; signed:0;
      
      print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ......
      
      Fixes: ec43908d ("net: skb: use auto-generation to convert skb drop reason to string")
      Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cb252c4
  3. 24 8月, 2022 3 次提交
  4. 20 7月, 2022 2 次提交
  5. 19 7月, 2022 2 次提交
  6. 08 7月, 2022 1 次提交
    • E
      net: minor optimization in __alloc_skb() · c2dd4059
      Eric Dumazet 提交于
      TCP allocates 'fast clones' skbs for packets in tx queues.
      
      Currently, __alloc_skb() initializes the companion fclone
      field to SKB_FCLONE_CLONE, and leaves other fields untouched.
      
      It makes sense to defer this init much later in skb_clone(),
      because all fclone fields are copied and hot in cpu caches
      at that time.
      
      This removes one cache line miss in __alloc_skb(), cost seen
      on an host with 256 cpus all competing on memory accesses.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2dd4059
  7. 25 6月, 2022 1 次提交
  8. 15 6月, 2022 1 次提交
  9. 10 6月, 2022 2 次提交
  10. 07 6月, 2022 1 次提交
    • M
      net: skb: use auto-generation to convert skb drop reason to string · ec43908d
      Menglong Dong 提交于
      It is annoying to add new skb drop reasons to 'enum skb_drop_reason'
      and TRACE_SKB_DROP_REASON in trace/event/skb.h, and it's easy to forget
      to add the new reasons we added to TRACE_SKB_DROP_REASON.
      
      TRACE_SKB_DROP_REASON is used to convert drop reason of type number
      to string. For now, the string we passed to user space is exactly the
      same as the name in 'enum skb_drop_reason' with a 'SKB_DROP_REASON_'
      prefix. Therefore, we can use 'auto-generation' to generate these
      drop reasons to string at build time.
      
      The new source 'dropreason_str.c' will be auto generated during build
      time, which contains the string array
      'const char * const drop_reasons[]'.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      ec43908d
  11. 21 5月, 2022 1 次提交
  12. 16 5月, 2022 3 次提交
    • E
      net: add skb_defer_max sysctl · 39564c3f
      Eric Dumazet 提交于
      commit 68822bdf ("net: generalize skb freeing
      deferral to per-cpu lists") added another per-cpu
      cache of skbs. It was expected to be small,
      and an IPI was forced whenever the list reached 128
      skbs.
      
      We might need to be able to control more precisely
      queue capacity and added latency.
      
      An IPI is generated whenever queue reaches half capacity.
      
      Default value of the new limit is 64.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39564c3f
    • E
      net: fix possible race in skb_attempt_defer_free() · 97e719a8
      Eric Dumazet 提交于
      A cpu can observe sd->defer_count reaching 128,
      and call smp_call_function_single_async()
      
      Problem is that the remote CPU can clear sd->defer_count
      before the IPI is run/acknowledged.
      
      Other cpus can queue more packets and also decide
      to call smp_call_function_single_async() while the pending
      IPI was not yet delivered.
      
      This is a common issue with smp_call_function_single_async().
      Callers must ensure correct synchronization and serialization.
      
      I triggered this issue while experimenting smaller threshold.
      Performing the call to smp_call_function_single_async()
      under sd->defer_lock protection did not solve the problem.
      
      Commit 5a18ceca ("smp: Allow smp_call_function_single_async()
      to insert locked csd") replaced an informative WARN_ON_ONCE()
      with a return of -EBUSY, which is often ignored.
      Test of CSD_FLAG_LOCK presence is racy anyway.
      
      Fixes: 68822bdf ("net: generalize skb freeing deferral to per-cpu lists")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97e719a8
    • M
      net: skb: check the boundrary of drop reason in kfree_skb_reason() · 20bbcd0a
      Menglong Dong 提交于
      Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after
      we make it the return value of the functions with return type of enum
      skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can
      be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason().
      
      So we check the range of drop reason in kfree_skb_reason() with
      DEBUG_NET_WARN_ON_ONCE().
      Reviewed-by: NJiang Biao <benbjiang@tencent.com>
      Reviewed-by: NHao Peng <flyingpeng@tencent.com>
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20bbcd0a
  13. 09 5月, 2022 1 次提交
    • L
      net: fix wrong network header length · cf3ab8d4
      Lina Wang 提交于
      When clatd starts with ebpf offloaing, and NETIF_F_GRO_FRAGLIST is enable,
      several skbs are gathered in skb_shinfo(skb)->frag_list. The first skb's
      ipv6 header will be changed to ipv4 after bpf_skb_proto_6_to_4,
      network_header\transport_header\mac_header have been updated as ipv4 acts,
      but other skbs in frag_list didnot update anything, just ipv6 packets.
      
      udp_queue_rcv_skb will call skb_segment_list to traverse other skbs in
      frag_list and make sure right udp payload is delivered to user space.
      Unfortunately, other skbs in frag_list who are still ipv6 packets are
      updated like the first skb and will have wrong transport header length.
      
      e.g.before bpf_skb_proto_6_to_4,the first skb and other skbs in frag_list
      has the same network_header(24)& transport_header(64), after
      bpf_skb_proto_6_to_4, ipv6 protocol has been changed to ipv4, the first
      skb's network_header is 44,transport_header is 64, other skbs in frag_list
      didnot change.After skb_segment_list, the other skbs in frag_list has
      different network_header(24) and transport_header(44), so there will be 20
      bytes different from original,that is difference between ipv6 header and
      ipv4 header. Just change transport_header to be the same with original.
      
      Actually, there are two solutions to fix it, one is traversing all skbs
      and changing every skb header in bpf_skb_proto_6_to_4, the other is
      modifying frag_list skb's header in skb_segment_list. Considering
      efficiency, adopt the second one--- when the first skb and other skbs in
      frag_list has different network_header length, restore them to make sure
      right udp payload is delivered to user space.
      Signed-off-by: NLina Wang <lina.wang@mediatek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf3ab8d4
  14. 06 5月, 2022 1 次提交
  15. 03 5月, 2022 1 次提交
  16. 30 4月, 2022 1 次提交
  17. 27 4月, 2022 1 次提交
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  18. 21 4月, 2022 1 次提交
  19. 01 4月, 2022 1 次提交
    • J
      skbuff: fix coalescing for page_pool fragment recycling · 1effe8ca
      Jean-Philippe Brucker 提交于
      Fix a use-after-free when using page_pool with page fragments. We
      encountered this problem during normal RX in the hns3 driver:
      
      (1) Initially we have three descriptors in the RX queue. The first one
          allocates PAGE1 through page_pool, and the other two allocate one
          half of PAGE2 each. Page references look like this:
      
                      RX_BD1 _______ PAGE1
                      RX_BD2 _______ PAGE2
                      RX_BD3 _________/
      
      (2) Handle RX on the first descriptor. Allocate SKB1, eventually added
          to the receive queue by tcp_queue_rcv().
      
      (3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
          netif_receive_skb():
      
          netif_receive_skb(SKB2)
            ip_rcv(SKB2)
              SKB3 = skb_clone(SKB2)
      
          SKB2 and SKB3 share a reference to PAGE2 through
          skb_shinfo()->dataref. The other ref to PAGE2 is still held by
          RX_BD3:
      
                            SKB2 ---+- PAGE2
                            SKB3 __/   /
                      RX_BD3 _________/
      
       (3b) Now while handling TCP, coalesce SKB3 with SKB1:
      
            tcp_v4_rcv(SKB3)
              tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
              kfree_skb_partial(SKB3)
                skb_release_data(SKB3)                // drops one dataref
      
                            SKB1 _____ PAGE1
                                 \____
                            SKB2 _____ PAGE2
                                       /
                      RX_BD3 _________/
      
          In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
          PAGE2, where it should instead have increased the page_pool frag
          reference, pp_frag_count. Without coalescing, when releasing both
          SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
          when releasing SKB1 and SKB2, two references to PAGE2 will be
          dropped, resulting in underflow.
      
       (3c) Drop SKB2:
      
            af_packet_rcv(SKB2)
              consume_skb(SKB2)
                skb_release_data(SKB2)                // drops second dataref
                  page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count
      
                            SKB1 _____ PAGE1
                                 \____
                                       PAGE2
                                       /
                      RX_BD3 _________/
      
      (4) Userspace calls recvmsg()
          Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
          release the SKB3 page as well:
      
          tcp_eat_recv_skb(SKB1)
            skb_release_data(SKB1)
              page_pool_return_skb_page(PAGE1)
              page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count
      
      (5) PAGE2 is freed, but the third RX descriptor was still using it!
          In our case this causes IOMMU faults, but it would silently corrupt
          memory if the IOMMU was disabled.
      
      Change the logic that checks whether pp_recycle SKBs can be coalesced.
      We still reject differing pp_recycle between 'from' and 'to' SKBs, but
      in order to avoid the situation described above, we also reject
      coalescing when both 'from' and 'to' are pp_recycled and 'from' is
      cloned.
      
      The new logic allows coalescing a cloned pp_recycle SKB into a page
      refcounted one, because in this case the release (4) will drop the right
      reference, the one taken by skb_try_coalesce().
      
      Fixes: 53e0961d ("page_pool: add frag page recycling support in page pool")
      Suggested-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NJean-Philippe Brucker <jean-philippe@linaro.org>
      Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1effe8ca
  20. 04 3月, 2022 1 次提交
  21. 03 3月, 2022 3 次提交
    • M
      net: Clear mono_delivery_time bit in __skb_tstamp_tx() · d93376f5
      Martin KaFai Lau 提交于
      In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
      the sk_error_queue.  The outgoing skb may have the mono delivery_time
      while the (rcv) timestamp is expected for the clone, so the
      skb->mono_delivery_time bit needs to be cleared from the clone.
      
      This patch adds the skb->mono_delivery_time clearing to the existing
      __net_timestamp() and use it in __skb_tstamp_tx().
      The __net_timestamp() fast path usage in dev.c is changed to directly
      call ktime_get_real() since the mono_delivery_time bit is not set at
      that point.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d93376f5
    • M
      net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101
      Martin KaFai Lau 提交于
      Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
      
      If skb->tstamp has the mono delivery_time, clearing it can hurt
      the performance when it finally transmits out to fq@phy-dev.
      
      The earlier patch added a skb->mono_delivery_time bit to
      flag the skb->tstamp carrying the mono delivery_time.
      
      This patch adds skb_clear_tstamp() helper which keeps
      the mono delivery_time and clears everything else.
      
      The delivery_time clearing will be postponed until the stack knows the
      skb will be delivered locally.  It will be done in a latter patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de799101
    • L
      net: fix up skbs delta_truesize in UDP GRO frag_list · 224102de
      lena wang 提交于
      The truesize for a UDP GRO packet is added by main skb and skbs in main
      skb's frag_list:
      skb_gro_receive_list
              p->truesize += skb->truesize;
      
      The commit 53475c5d ("net: fix use-after-free when UDP GRO with
      shared fraglist") introduced a truesize increase for frag_list skbs.
      When uncloning skb, it will call pskb_expand_head and trusesize for
      frag_list skbs may increase. This can occur when allocators uses
      __netdev_alloc_skb and not jump into __alloc_skb. This flow does not
      use ksize(len) to calculate truesize while pskb_expand_head uses.
      skb_segment_list
      err = skb_unclone(nskb, GFP_ATOMIC);
      pskb_expand_head
              if (!skb->sk || skb->destructor == sock_edemux)
                      skb->truesize += size - osize;
      
      If we uses increased truesize adding as delta_truesize, it will be
      larger than before and even larger than previous total truesize value
      if skbs in frag_list are abundant. The main skb truesize will become
      smaller and even a minus value or a huge value for an unsigned int
      parameter. Then the following memory check will drop this abnormal skb.
      
      To avoid this error we should use the original truesize to segment the
      main skb.
      
      Fixes: 53475c5d ("net: fix use-after-free when UDP GRO with shared fraglist")
      Signed-off-by: Nlena wang <lena.wang@mediatek.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/1646133431-8948-1-git-send-email-lena.wang@mediatek.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      224102de
  22. 23 2月, 2022 3 次提交
    • E
      net: preserve skb_end_offset() in skb_unclone_keeptruesize() · 2b88cba5
      Eric Dumazet 提交于
      syzbot found another way to trigger the infamous WARN_ON_ONCE(delta < len)
      in skb_try_coalesce() [1]
      
      I was able to root cause the issue to kfence.
      
      When kfence is in action, the following assertion is no longer true:
      
      int size = xxxx;
      void *ptr1 = kmalloc(size, gfp);
      void *ptr2 = kmalloc(size, gfp);
      
      if (ptr1 && ptr2)
      	ASSERT(ksize(ptr1) == ksize(ptr2));
      
      We attempted to fix these issues in the blamed commits, but forgot
      that TCP was possibly shifting data after skb_unclone_keeptruesize()
      has been used, notably from tcp_retrans_try_collapse().
      
      So we not only need to keep same skb->truesize value,
      we also need to make sure TCP wont fill new tailroom
      that pskb_expand_head() was able to get from a
      addr = kmalloc(...) followed by ksize(addr)
      
      Split skb_unclone_keeptruesize() into two parts:
      
      1) Inline skb_unclone_keeptruesize() for the common case,
         when skb is not cloned.
      
      2) Out of line __skb_unclone_keeptruesize() for the 'slow path'.
      
      WARNING: CPU: 1 PID: 6490 at net/core/skbuff.c:5295 skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
      Modules linked in:
      CPU: 1 PID: 6490 Comm: syz-executor161 Not tainted 5.17.0-rc4-syzkaller-00229-g4f12b742 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
      Code: bf 01 00 00 00 0f b7 c0 89 c6 89 44 24 20 e8 62 24 4e fa 8b 44 24 20 83 e8 01 0f 85 e5 f0 ff ff e9 87 f4 ff ff e8 cb 20 4e fa <0f> 0b e9 06 f9 ff ff e8 af b2 95 fa e9 69 f0 ff ff e8 95 b2 95 fa
      RSP: 0018:ffffc900063af268 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 00000000ffffffd5 RCX: 0000000000000000
      RDX: ffff88806fc05700 RSI: ffffffff872abd55 RDI: 0000000000000003
      RBP: ffff88806e675500 R08: 00000000ffffffd5 R09: 0000000000000000
      R10: ffffffff872ab659 R11: 0000000000000000 R12: ffff88806dd554e8
      R13: ffff88806dd9bac0 R14: ffff88806dd9a2c0 R15: 0000000000000155
      FS:  00007f18014f9700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020002000 CR3: 000000006be7a000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_try_coalesce net/ipv4/tcp_input.c:4651 [inline]
       tcp_try_coalesce+0x393/0x920 net/ipv4/tcp_input.c:4630
       tcp_queue_rcv+0x8a/0x6e0 net/ipv4/tcp_input.c:4914
       tcp_data_queue+0x11fd/0x4bb0 net/ipv4/tcp_input.c:5025
       tcp_rcv_established+0x81e/0x1ff0 net/ipv4/tcp_input.c:5947
       tcp_v4_do_rcv+0x65e/0x980 net/ipv4/tcp_ipv4.c:1719
       sk_backlog_rcv include/net/sock.h:1037 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2779
       release_sock+0x54/0x1b0 net/core/sock.c:3311
       sk_wait_data+0x177/0x450 net/core/sock.c:2821
       tcp_recvmsg_locked+0xe28/0x1fd0 net/ipv4/tcp.c:2457
       tcp_recvmsg+0x137/0x610 net/ipv4/tcp.c:2572
       inet_recvmsg+0x11b/0x5e0 net/ipv4/af_inet.c:850
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c4777efa ("net: add and use skb_unclone_keeptruesize() helper")
      Fixes: 097b9146 ("net: fix up truesize of cloned skb in skb_prepare_for_shift()")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2b88cba5
    • E
      net: add skb_set_end_offset() helper · 763087da
      Eric Dumazet 提交于
      We have multiple places where this helper is convenient,
      and plan using it in the following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      763087da
    • E
      net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends · ef527f96
      Eric Dumazet 提交于
      Whenever one of these functions pull all data from an skb in a frag_list,
      use consume_skb() instead of kfree_skb() to avoid polluting drop
      monitoring.
      
      Fixes: 6fa01ccd ("skbuff: Add pskb_extract() helper function")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220220154052.1308469-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      ef527f96
  23. 18 2月, 2022 1 次提交
    • E
      net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57
      Eric Dumazet 提交于
      UDP sendmsg() can be lockless, this is causing all kinds
      of data races.
      
      This patch converts sk->sk_tskey to remove one of these races.
      
      BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
      
      read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
       __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
       __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000054d -> 0x0000054e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1cdec57
  24. 10 2月, 2022 1 次提交
  25. 10 1月, 2022 1 次提交
    • M
      net: skb: introduce kfree_skb_reason() · c504e5c2
      Menglong Dong 提交于
      Introduce the interface kfree_skb_reason(), which is able to pass
      the reason why the skb is dropped to 'kfree_skb' tracepoint.
      
      Add the 'reason' field to 'trace_kfree_skb', therefor user can get
      more detail information about abnormal skb with 'drop_monitor' or
      eBPF.
      
      All drop reasons are defined in the enum 'skb_drop_reason', and
      they will be print as string in 'kfree_skb' tracepoint in format
      of 'reason: XXX'.
      
      ( Maybe the reasons should be defined in a uapi header file, so that
      user space can use them? )
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c504e5c2
  26. 16 12月, 2021 1 次提交
  27. 08 12月, 2021 1 次提交
  28. 22 11月, 2021 1 次提交
  29. 16 11月, 2021 1 次提交