1. 02 3月, 2015 1 次提交
    • E
      net: move skb->dropcount to skb->cb[] · 744d5a3e
      Eyal Birger 提交于
      Commit 97775007 ("af_packet: add interframe drop cmsg (v6)")
      unionized skb->mark and skb->dropcount in order to allow recording
      of the socket drop count while maintaining struct sk_buff size.
      
      skb->dropcount was introduced since there was no available room
      in skb->cb[] in packet sockets. However, its introduction led to
      the inability to export skb->mark, or any other aliased field to
      userspace if so desired.
      
      Moving the dropcount metric to skb->cb[] eliminates this problem
      at the expense of 4 bytes less in skb->cb[] for protocol families
      using it.
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      744d5a3e
  2. 23 2月, 2015 1 次提交
  3. 12 2月, 2015 2 次提交
    • T
      net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload · 15e2396d
      Tom Herbert 提交于
      This patch adds infrastructure so that remote checksum offload can
      set CHECKSUM_PARTIAL instead of calling csum_partial and writing
      the modfied checksum field.
      
      Add skb_remcsum_adjust_partial function to set an skb for using
      CHECKSUM_PARTIAL with remote checksum offload.  Changed
      skb_remcsum_process and skb_gro_remcsum_process to take a boolean
      argument to indicate if checksum partial can be set or the
      checksum needs to be modified using the normal algorithm.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e2396d
    • T
      net: Clarify meaning of CHECKSUM_PARTIAL for receive path · 6edec0e6
      Tom Herbert 提交于
      The current meaning of CHECKSUM_PARTIAL for validating checksums
      is that _all_ checksums in the packet are considered valid.
      However, in the manner that CHECKSUM_PARTIAL is set only the checksum
      at csum_start+csum_offset and any preceding checksums may
      be considered valid. If there are checksums in the packet after
      csum_offset it is possible they have not been verfied.
      
      This patch changes CHECKSUM_PARTIAL logic in skb_csum_unnecessary and
      __skb_gro_checksum_validate_needed to only considered checksums
      referring to csum_start and any preceding checksums (with starting
      offset before csum_start) to be verified.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6edec0e6
  4. 09 2月, 2015 1 次提交
  5. 05 2月, 2015 2 次提交
    • T
      net: add skb functions to process remote checksum offload · dcdc8994
      Tom Herbert 提交于
      This patch adds skb_remcsum_process and skb_gro_remcsum_process to
      perform the appropriate adjustments to the skb when receiving
      remote checksum offload.
      
      Updated vxlan and gue to use these functions.
      
      Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
      not see any change in performance.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcdc8994
    • E
      xps: fix xps for stacked devices · 2bd82484
      Eric Dumazet 提交于
      A typical qdisc setup is the following :
      
      bond0 : bonding device, using HTB hierarchy
      eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc
      
      XPS allows to spread packets on specific tx queues, based on the cpu
      doing the send.
      
      Problem is that dequeues from bond0 qdisc can happen on random cpus,
      due to the fact that qdisc_run() can dequeue a batch of packets.
      
      CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
      CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0
      
      CPUB -> dequeue packet P1 from bond0
              enqueue packet on eth1/eth2
      CPUC -> dequeue packet P2 from bond0
              enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)
      
      get_xps_queue() then might select wrong queue for P1, since current cpu
      might be different than CPUA.
      
      P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
      if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)
      
      Effect of this bug is TCP reorders, and more generally not optimal
      TX queue placement. (A victim bulk flow can be migrated to the wrong TX
      queue for a while)
      
      To fix this, we have to record sender cpu number the first time
      dev_queue_xmit() is called for one tx skb.
      
      We can union napi_id (used on receive path) and sender_cpu,
      granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
      this union idea)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bd82484
  6. 04 2月, 2015 2 次提交
  7. 11 12月, 2014 2 次提交
    • A
      net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb · fd11a83d
      Alexander Duyck 提交于
      This change pulls the core functionality out of __netdev_alloc_skb and
      places them in a new function named __alloc_rx_skb.  The reason for doing
      this is to make these bits accessible to a new function __napi_alloc_skb.
      In addition __alloc_rx_skb now has a new flags value that is used to
      determine which page frag pool to allocate from.  If the SKB_ALLOC_NAPI
      flag is set then the NAPI pool is used.  The advantage of this is that we
      do not have to use local_irq_save/restore when accessing the NAPI pool from
      NAPI context.
      
      In my test setup I saw at least 11ns of savings using the napi_alloc_skb
      function versus the netdev_alloc_skb function, most of this being due to
      the fact that we didn't have to call local_irq_save/restore.
      
      The main use case for napi_alloc_skb would be for things such as copybreak
      or page fragment based receive paths where an skb is allocated after the
      data has been received instead of before.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd11a83d
    • A
      net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag · ffde7328
      Alexander Duyck 提交于
      This patch splits the netdev_alloc_frag function up so that it can be used
      on one of two page frag pools instead of being fixed on the
      netdev_alloc_cache.  By doing this we can add a NAPI specific function
      __napi_alloc_frag that accesses a pool that is only used from softirq
      context.  The advantage to this is that we do not need to call
      local_irq_save/restore which can be a significant savings.
      
      I also took the opportunity to refactor the core bits that were placed in
      __alloc_page_frag.  First I updated the allocation to do either a 32K
      allocation or an order 0 page.  This is based on the changes in commmit
      d9b2938a where it was found that latencies could be reduced in case of
      failures.  Then I also rewrote the logic to work from the end of the page to
      the start.  By doing this the size value doesn't have to be used unless we
      have run out of space for page fragments.  Finally I cleaned up the atomic
      bits so that we just do an atomic_sub_and_test and if that returns true then
      we set the page->_count via an atomic_set.  This way we can remove the extra
      conditional for the atomic_read since it would have led to an atomic_inc in
      the case of success anyway.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffde7328
  8. 10 12月, 2014 5 次提交
    • A
      skb_copy_datagram_iovec() can die · d3a9632f
      Al Viro 提交于
      no callers other than itself.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d3a9632f
    • A
      switch memcpy_to_msg() and skb_copy{,_and_csum}_datagram_msg() to primitives · e5a4b0bb
      Al Viro 提交于
      ... making both non-draining.  That means that tcp_recvmsg() becomes
      non-draining.  And _that_ would break iscsit_do_rx_data() unless we
      	a) make sure tcp_recvmsg() is uniformly non-draining (it is)
      	b) make sure it copes with arbitrary (including shifted)
      iov_iter (it does, all it uses is iov_iter primitives)
      	c) make iscsit_do_rx_data() initialize ->msg_iter only once.
      
      Fortunately, (c) is doable with minimal work and we are rid of one
      the two places where kernel send/recvmsg users would be unhappy with
      non-draining behaviour.
      
      Actually, that makes all but one of ->recvmsg() instances iov_iter-clean.
      The exception is skcipher_recvmsg() and it also isn't hard to convert
      to primitives (iov_iter_get_pages() is needed there).  That'll wait
      a bit - there's some interplay with ->sendmsg() path for that one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e5a4b0bb
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
    • H
      dst: no need to take reference on DST_NOCACHE dsts · dbfc4fb7
      Hannes Frederic Sowa 提交于
      Since commit f8864972 ("ipv4: fix dst race in sk_dst_get()")
      DST_NOCACHE dst_entries get freed by RCU. So there is no need to get a
      reference on them when we are in rcu protected sections.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbfc4fb7
    • E
      net: avoid two atomic operations in fast clones · 6ffe75eb
      Eric Dumazet 提交于
      Commit ce1a4ea3 ("net: avoid one atomic operation in skb_clone()")
      took the wrong way to save one atomic operation.
      
      It is actually possible to avoid two atomic operations, if we
      do not change skb->fclone values, and only rely on clone_ref
      content to signal if the clone is available or not.
      
      skb_clone() can simply use the fast clone if clone_ref is 1.
      
      kfree_skbmem() can avoid the atomic_dec_and_test() if clone_ref is 1.
      
      Note that because we usually free the clone before the original skb,
      this particular attempt is only done for the original skb to have better
      branch prediction.
      
      SKB_FCLONE_FREE is removed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ffe75eb
  9. 09 12月, 2014 1 次提交
    • A
      net: Add functions for handling padding frame and adding to length · 9c0c1124
      Alexander Duyck 提交于
      This patch adds two new helper functions skb_put_padto and eth_skb_pad.
      These functions deviate from the standard skb_pad or skb_padto in that they
      will also update the length and tail pointers so that they reflect the
      padding added to the frame.
      
      The eth_skb_pad helper is meant to be used with Ethernet devices to update
      either Rx or Tx frames so that they report the correct size.  The
      skb_put_padto helper is meant to be used primarily in the transmit path for
      network devices that need frames to be padded up to some minimum size and
      don't wish to simply update the length somewhere external to the frame.
      
      The motivation behind this is that there are a number of implementations
      throughout the network device drivers that are all doing the same thing,
      but each a little bit differently and as a result several implementations
      contain bugs such as updating the length without updating the tail offset
      and other similar issues.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c0c1124
  10. 24 11月, 2014 6 次提交
  11. 22 11月, 2014 2 次提交
  12. 12 11月, 2014 2 次提交
    • A
      net: Remove __skb_alloc_page and __skb_alloc_pages · 160d2aba
      Alexander Duyck 提交于
      Remove the two functions which are now dead code.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      160d2aba
    • A
      net: Add device Rx page allocation function · 71dfda58
      Alexander Duyck 提交于
      This patch implements __dev_alloc_pages and __dev_alloc_page.  These are
      meant to replace the __skb_alloc_pages and __skb_alloc_page functions.  The
      reason for doing this is that it occurred to me that __skb_alloc_page is
      supposed to be passed an sk_buff pointer, but it is NULL in all cases where
      it is used.  Worse is that in the case of ixgbe it is passed NULL via the
      sk_buff pointer in the rx_buffer info structure which means the compiler is
      not correctly stripping it out.
      
      The naming for these functions is based on dev_alloc_skb and __dev_alloc_skb.
      There was originally a netdev_alloc_page, however that was passed a
      net_device pointer and this function is not so I thought it best to follow
      that naming scheme since that is the same difference between dev_alloc_skb
      and netdev_alloc_skb.
      
      In the case of anything greater than order 0 it is assumed that we want a
      compound page so __GFP_COMP is set for all allocations as we expect a
      compound page when assigning a page frag.
      
      The other change in this patch is to exploit the behaviors of the page
      allocator in how it handles flags.  So for example we can always set
      __GFP_COMP and __GFP_MEMALLOC since they are ignored if they are not
      applicable or are overridden by another flag.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71dfda58
  13. 08 11月, 2014 2 次提交
  14. 06 11月, 2014 3 次提交
  15. 04 11月, 2014 1 次提交
  16. 31 10月, 2014 1 次提交
  17. 29 10月, 2014 1 次提交
  18. 15 10月, 2014 1 次提交
  19. 09 10月, 2014 1 次提交
  20. 05 10月, 2014 1 次提交
    • V
      net: Cleanup skb cloning by adding SKB_FCLONE_FREE · c8753d55
      Vijay Subramanian 提交于
      SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
      1: If skb is allocated from head_cache, it indicates fclone is not available.
      2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
      it is available to be used.
      
      To avoid confusion for case 2 above, this patch  replaces
      SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
      companion skbs, this indicates it is free for use.
      
      SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
      cannot / will not have a companion fclone.
      Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8753d55
  21. 02 10月, 2014 2 次提交
    • T
      udp: Generalize skb_udp_segment · 8bce6d7d
      Tom Herbert 提交于
      skb_udp_segment is the function called from udp4_ufo_fragment to
      segment a UDP tunnel packet. This function currently assumes
      segmentation is transparent Ethernet bridging (i.e. VXLAN
      encapsulation). This patch generalizes the function to
      operate on either Ethertype or IP protocol.
      
      The inner_protocol field must be set to the protocol of the inner
      header. This can now be either an Ethertype or an IP protocol
      (in a union). A new flag in the skbuff indicates which type is
      effective. skb_set_inner_protocol and skb_set_inner_ipproto
      helper functions were added to set the inner_protocol. These
      functions are called from the point where the tunnel encapsulation
      is occuring.
      
      When skb_udp_tunnel_segment is called, the function to segment the
      inner packet is selected based on the inner IP or Ethertype. In the
      case of an IP protocol encapsulation, the function is derived from
      inet[6]_offloads. In the case of Ethertype, skb->protocol is
      set to the inner_protocol and skb_mac_gso_segment is called. (GRE
      currently does this, but it might be possible to lookup the protocol
      in offload_base and call the appropriate segmenation function
      directly).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bce6d7d
    • E
      net: cleanup and document skb fclone layout · d0bf4a9e
      Eric Dumazet 提交于
      Lets use a proper structure to clearly document and implement
      skb fast clones.
      
      Then, we might experiment more easily alternative layouts.
      
      This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
      to stop leaking of implementation details.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0bf4a9e