1. 05 6月, 2017 1 次提交
    • J
      skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow · 48a1df65
      Jason A. Donenfeld 提交于
      This is a defense-in-depth measure in response to bugs like
      4d6fa57b ("macsec: avoid heap overflow in skb_to_sgvec"). There's
      not only a potential overflow of sglist items, but also a stack overflow
      potential, so we fix this by limiting the amount of recursion this function
      is allowed to do. Not actually providing a bounded base case is a future
      disaster that we can easily avoid here.
      
      As a small matter of house keeping, we take this opportunity to move the
      documentation comment over the actual function the documentation is for.
      
      While this could be implemented by using an explicit stack of skbuffs,
      when implementing this, the function complexity increased considerably,
      and I don't think such complexity and bloat is actually worth it. So,
      instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
      and measured the stack usage there. I also reverted the recent MIPS
      changes that give it a separate IRQ stack, so that I could experience
      some worst-case situations. I found that limiting it to 24 layers deep
      yielded a good stack usage with room for safety, as well as being much
      deeper than any driver actually ever creates.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48a1df65
  2. 22 5月, 2017 1 次提交
    • M
      net: allow simultaneous SW and HW transmit timestamping · b50a5c70
      Miroslav Lichvar 提交于
      Add SOF_TIMESTAMPING_OPT_TX_SWHW option to allow an outgoing packet to
      be looped to the socket's error queue with a software timestamp even
      when a hardware transmit timestamp is expected to be provided by the
      driver.
      
      Applications using this option will receive two separate messages from
      the error queue, one with a software timestamp and the other with a
      hardware timestamp. As the hardware timestamp is saved to the shared skb
      info, which may happen before the first message with software timestamp
      is received by the application, the hardware timestamp is copied to the
      SCM_TIMESTAMPING control message only when the skb has no software
      timestamp or it is an incoming packet.
      
      While changing sw_tx_timestamp(), inline it in skb_tx_timestamp() as
      there are no other users.
      
      CC: Richard Cochran <richardcochran@gmail.com>
      CC: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b50a5c70
  3. 20 5月, 2017 1 次提交
  4. 29 4月, 2017 1 次提交
  5. 22 4月, 2017 1 次提交
    • I
      gso: Validate assumption of frag_list segementation · 43170c4e
      Ilan Tayari 提交于
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Change-Id: Ia03983f4a47b6534dd987d7a2aad96d54d46d212
      Fixes: 07b26c94 ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43170c4e
  6. 18 4月, 2017 2 次提交
    • I
      gso: Validate assumption of frag_list segementation · 7a7a9bd7
      Ilan Tayari 提交于
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Fixes: 07b26c94 ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a7a9bd7
    • W
      net-timestamp: avoid use-after-free in ip_recv_error · 1862d620
      Willem de Bruijn 提交于
      Syzkaller reported a use-after-free in ip_recv_error at line
      
          info->ipi_ifindex = skb->dev->ifindex;
      
      This function is called on dequeue from the error queue, at which
      point the device pointer may no longer be valid.
      
      Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
      pointer is valid or NULL. Store it in temporary storage skb->cb.
      
      It is safe to reference skb->dev here, as called from device drivers
      or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
      in that case it is NULL and ifindex is set to 0 (invalid).
      
      Do not return a pktinfo cmsg if ifindex is 0. This maintains the
      current behavior of not returning a cmsg if skb->dev was NULL.
      
      On dequeue, the ipv4 path will cast from sock_exterr_skb to
      in_pktinfo. Both have ifindex as their first element, so no explicit
      conversion is needed. This is by design, introduced in commit
      0b922b7a ("net: original ingress device index in PKTINFO"). For
      ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.
      
      Fixes: 829ae9d6 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1862d620
  7. 13 4月, 2017 1 次提交
    • I
      gso: Support frag_list splitting with head_frag · eaffadbb
      Ilan Tayari 提交于
      A driver may use build_skb() for received packets.
      These SKBs then have a head_frag.
      
      Since commit d7e8883c ("net: make GRO aware of
      skb->head_frag"), GRO may build frag_list SKBs out of
      head_frag received SKBs.
      In such a case, the chained SKBs end up with a head_frag.
      
      Commit 07b26c94 ("gso: Support partial splitting at
      the frag_list pointer") adds partial segmentation of frag_list
      SKB chains into individual SKBs.
      However, this is not done if the chained SKBs have any
      linear part, because the device may not be able to DMA
      the private linear buffer.
      
      A chained frag_list SKB with head_frag is wrongfully
      detected in this case as having a private linear part
      and thus falls back to software GSO, while in fact the
      linear part is backed by a DMA page just like any other frag.
      
      This causes low performance when forwarding those packets
      that were built with build_skb()
      
      Allow partial segmentation at the frag_list pointer for
      chained SKBs with head_frag.
      
      Note that such SKBs can only be created by GRO, when applied
      to received packets with head_frag.
      Also note that this change only affects the data path that
      performs the partial segmentation at frag_list pointer, and
      not any of the other more common data paths.
      Signed-off-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaffadbb
  8. 22 3月, 2017 2 次提交
  9. 08 3月, 2017 2 次提交
  10. 03 2月, 2017 1 次提交
  11. 02 2月, 2017 1 次提交
  12. 28 1月, 2017 1 次提交
    • E
      net: adjust skb->truesize in pskb_expand_head() · 158f323b
      Eric Dumazet 提交于
      Slava Shwartsman reported a warning in skb_try_coalesce(), when we
      detect skb->truesize is completely wrong.
      
      In his case, issue came from IPv6 reassembly coping with malicious
      datagrams, that forced various pskb_may_pull() to reallocate a bigger
      skb->head than the one allocated by NIC driver before entering GRO
      layer.
      
      Current code does not change skb->truesize, leaving this burden to
      callers if they care enough.
      
      Blindly changing skb->truesize in pskb_expand_head() is not
      easy, as some producers might track skb->truesize, for example
      in xmit path for back pressure feedback (sk->sk_wmem_alloc)
      
      We can detect the cases where it should be safe to change
      skb->truesize :
      
      1) skb is not attached to a socket.
      2) If it is attached to a socket, destructor is sock_edemux()
      
      My audit gave only two callers doing their own skb->truesize
      manipulation.
      
      I had to remove skb parameter in sock_edemux macro when
      CONFIG_INET is not set to avoid a compile error.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSlava Shwartsman <slavash@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      158f323b
  13. 11 1月, 2017 1 次提交
  14. 09 1月, 2017 1 次提交
  15. 27 12月, 2016 1 次提交
  16. 26 12月, 2016 1 次提交
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  17. 25 12月, 2016 1 次提交
  18. 10 12月, 2016 1 次提交
  19. 09 12月, 2016 1 次提交
    • E
      udp: under rx pressure, try to condense skbs · c8c8b127
      Eric Dumazet 提交于
      Under UDP flood, many softirq producers try to add packets to
      UDP receive queue, and one user thread is burning one cpu trying
      to dequeue packets as fast as possible.
      
      Two parts of the per packet cost are :
      - copying payload from kernel space to user space,
      - freeing memory pieces associated with skb.
      
      If socket is under pressure, softirq handler(s) can try to pull in
      skb->head the payload of the packet if it fits.
      
      Meaning the softirq handler(s) can free/reuse the page fragment
      immediately, instead of letting udp_recvmsg() do this hundreds of usec
      later, possibly from another node.
      
      Additional gains :
      - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
      - We avoid cache line misses at copyout() time and consume_skb() time,
      and avoid one put_page() with potential alien freeing on NUMA hosts.
      
      This comes at the cost of a copy, bounded to available tail room, which
      is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
      than necessary)
      
      This patch gave me about 5 % increase in throughput in my tests.
      
      skb_condense() helper could probably used in other contexts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8c8b127
  20. 02 12月, 2016 1 次提交
  21. 30 11月, 2016 1 次提交
    • F
      tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING · 1c885808
      Francis Yan 提交于
      This patch exports the sender chronograph stats via the socket
      SO_TIMESTAMPING channel. Currently we can instrument how long a
      particular application unit of data was queued in TCP by tracking
      SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
      these sender chronograph stats exported simultaneously along with
      these timestamps allow further breaking down the various sender
      limitation.  For example, a video server can tell if a particular
      chunk of video on a connection takes a long time to deliver because
      TCP was experiencing small receive window. It is not possible to
      tell before this patch without packet traces.
      
      To prepare these stats, the user needs to set
      SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
      while requesting other SOF_TIMESTAMPING TX timestamps. When the
      timestamps are available in the error queue, the stats are returned
      in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
      in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
      TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.
      Signed-off-by: NFrancis Yan <francisyyan@gmail.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c885808
  22. 25 11月, 2016 1 次提交
    • E
      tcp: enhance tcp_collapse_retrans() with skb_shift() · f8071cde
      Eric Dumazet 提交于
      In commit 2331ccc5 ("tcp: enhance tcp collapsing"),
      we made a first step allowing copying right skb to left skb head.
      
      Since all skbs in socket write queue are headless (but possibly the very
      first one), this strategy often does not work.
      
      This patch extends tcp_collapse_retrans() to perform frag shifting,
      thanks to skb_shift() helper.
      
      This helper needs to not BUG on non headless skbs, as callers are ok
      with that.
      
      Tested:
      
      Following packetdrill test now passes :
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      +.100 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
         +0 write(4, ..., 200) = 200
         +0 > P. 1:201(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 201:401(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 401:601(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 601:801(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 801:1001(200) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1001:1101(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1101:1201(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1201:1301(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1301:1401(100) ack 1
      
      +.099 < . 1:1(0) ack 201 win 257
      +.001 < . 1:1(0) ack 201 win 257 <nop,nop,sack 1001:1401>
         +0 > P. 201:1001(800) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8071cde
  23. 20 11月, 2016 1 次提交
  24. 08 11月, 2016 1 次提交
    • S
      sock: do not set sk_err in sock_dequeue_err_skb · f5f99309
      Soheil Hassas Yeganeh 提交于
      Do not set sk_err when dequeuing errors from the error queue.
      Doing so results in:
      a) Bugs: By overwriting existing sk_err values, it possibly
         hides legitimate errors. It is also incorrect when local
         errors are queued with ip_local_error. That happens in the
         context of a system call, which already returns the error
         code.
      b) Inconsistent behavior: When there are pending errors on
         the error queue, sk_err is sometimes 0 (e.g., for
         the first timestamp on the error queue) and sometimes
         set to an error code (after dequeuing the first
         timestamp).
      c) Suboptimality: Setting sk_err to ENOMSG on simple
         TX timestamps can abort parallel reads and writes.
      
      Removing this line doesn't break userspace. This is because
      userspace code cannot rely on sk_err for detecting whether
      there is something on the error queue. Except for ICMP messages
      received for UDP and RAW, sk_err is not set at enqueue time,
      and as a result sk_err can be 0 while there are plenty of
      errors on the error queue.
      
      For ICMP packets in UDP and RAW, sk_err is set when they are
      enqueued on the error queue, but that does not result in aborting
      reads and writes. For such cases, sk_err is only readable via
      getsockopt(SO_ERROR) which will reset the value of sk_err on
      its own. More importantly, prior to this patch,
      recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
      sk_err is set by sock_dequeue_err_skb without atomic ops or
      locks) which can store 0 in sk_err even when we have ICMP
      messages pending. Removing this line from sock_dequeue_err_skb
      eliminates that race.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5f99309
  25. 04 10月, 2016 2 次提交
    • S
      net: skbuff: Limit skb_vlan_pop/push() to expect skb->data at mac header · b6a79208
      Shmulik Ladkani 提交于
      skb_vlan_pop/push were too generic, trying to support the cases where
      skb->data is at mac header, and cases where skb->data is arbitrarily
      elsewhere.
      
      Supporting an arbitrary skb->data was complex and bogus:
       - It failed to unwind skb->data to its original location post actual
         pop/push.
         (Also, semantic is not well defined for unwinding: If data was into
          the eth header, need to use same offset from start; But if data was
          at network header or beyond, need to adjust the original offset
          according to the push/pull)
       - It mangled the rcsum post actual push/pop, without taking into account
         that the eth bytes might already have been pulled out of the csum.
      
      Most callers (ovs, bpf) already had their skb->data at mac_header upon
      invoking skb_vlan_pop/push.
      Last caller that failed to do so (act_vlan) has been recently fixed.
      
      Therefore, to simplify things, no longer support arbitrary skb->data
      inputs for skb_vlan_pop/push().
      
      skb->data is expected to be exactly at mac_header; WARN otherwise.
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Pravin Shelar <pshelar@ovn.org>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6a79208
    • A
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro 提交于
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25869262
  26. 22 9月, 2016 3 次提交
  27. 20 9月, 2016 1 次提交
  28. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  29. 02 7月, 2016 1 次提交
  30. 04 6月, 2016 5 次提交