1. 20 12月, 2018 4 次提交
    • F
      net: use skb_sec_path helper in more places · 2294be0f
      Florian Westphal 提交于
      skb_sec_path gains 'const' qualifier to avoid
      xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
      
      same reasoning as previous conversions: Won't need to touch these
      spots anymore when skb->sp is removed.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2294be0f
    • F
      net: move secpath_exist helper to sk_buff.h · 7af8f4ca
      Florian Westphal 提交于
      Future patch will remove skb->sp pointer.
      To reduce noise in those patches, move existing helper to
      sk_buff and use it in more places to ease skb->sp replacement later.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7af8f4ca
    • F
      net: convert bridge_nf to use skb extension infrastructure · de8bda1d
      Florian Westphal 提交于
      This converts the bridge netfilter (calling iptables hooks from bridge)
      facility to use the extension infrastructure.
      
      The bridge_nf specific hooks in skb clone and free paths are removed, they
      have been replaced by the skb_ext hooks that do the same as the bridge nf
      allocations hooks did.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8bda1d
    • F
      sk_buff: add skb extension infrastructure · df5042f4
      Florian Westphal 提交于
      This adds an optional extension infrastructure, with ispec (xfrm) and
      bridge netfilter as first users.
      objdiff shows no changes if kernel is built without xfrm and br_netfilter
      support.
      
      The third (planned future) user is Multipath TCP which is still
      out-of-tree.
      MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
      numbers used by individual subflows.
      
      This DSS mapping is read/written from tcp option space on receive and
      written to tcp option space on transmitted tcp packets that are part of
      and MPTCP connection.
      
      Extending skb_shared_info or adding a private data field to skb fclones
      doesn't work for incoming skb, so a different DSS propagation method would
      be required for the receive side.
      
      mptcp has same requirements as secpath/bridge netfilter:
      
      1. extension memory is released when the sk_buff is free'd.
      2. data is shared after cloning an skb (clone inherits extension)
      3. adding extension to an skb will COW the extension buffer if needed.
      
      The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
      mapping for tx and rx processing.
      
      Two new members are added to sk_buff:
      1. 'active_extensions' byte (filling a hole), telling which extensions
         are available for this skb.
         This has two purposes.
         a) avoids the need to initialize the pointer.
         b) allows to "delete" an extension by clearing its bit
         value in ->active_extensions.
      
         While it would be possible to store the active_extensions byte
         in the extension struct instead of sk_buff, there is one problem
         with this:
          When an extension has to be disabled, we can always clear the
          bit in skb->active_extensions.  But in case it would be stored in the
          extension buffer itself, we might have to COW it first, if
          we are dealing with a cloned skb.  On kmalloc failure we would
          be unable to turn an extension off.
      
      2. extension pointer, located at the end of the sk_buff.
         If the active_extensions byte is 0, the pointer is undefined,
         it is not initialized on skb allocation.
      
      This adds extra code to skb clone and free paths (to deal with
      refcount/free of extension area) but this replaces similar code that
      manages skb->nf_bridge and skb->sp structs in the followup patches of
      the series.
      
      It is possible to add support for extensions that are not preseved on
      clones/copies.
      
      To do this, it would be needed to define a bitmask of all extensions that
      need copy/cow semantics, and change __skb_ext_copy() to check
      ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
      ->active_extensions to 0 on the new clone.
      
      This isn't done here because all extensions that get added here
      need the copy/cow semantics.
      
      v2:
      Allocate entire extension space using kmem_cache.
      Upside is that this allows better tracking of used memory,
      downside is that we will allocate more space than strictly needed in
      most cases (its unlikely that all extensions are active/needed at same
      time for same skb).
      The allocated memory (except the small extension header) is not cleared,
      so no additonal overhead aside from memory usage.
      
      Avoid atomic_dec_and_test operation on skb_ext_put()
      by using similar trick as kfree_skbmem() does with fclone_ref:
      If recount is 1, there is no concurrent user and we can free right away.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5042f4
  2. 05 12月, 2018 1 次提交
    • I
      skbuff: Rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark' · 875e8939
      Ido Schimmel 提交于
      Commit abf4bb6b ("skbuff: Add the offload_mr_fwd_mark field") added
      the 'offload_mr_fwd_mark' field to indicate that a packet has already
      undergone L3 multicast routing by a capable device. The field is used to
      prevent the kernel from forwarding a packet through a netdev through
      which the device has already forwarded the packet.
      
      Currently, no unicast packet is routed by both the device and the
      kernel, but this is about to change by subsequent patches and we need to
      be able to mark such packets, so that they will no be forwarded twice.
      
      Instead of adding yet another field to 'struct sk_buff', we can just
      rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
      either has a multicast or a unicast destination IP.
      
      While at it, add a comment about both 'offload_fwd_mark' and
      'offload_l3_fwd_mark'.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      875e8939
  3. 04 12月, 2018 2 次提交
    • W
      udp: elide zerocopy operation in hot path · 52900d22
      Willem de Bruijn 提交于
      With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
      Release of its last reference triggers a completion notification.
      
      The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
      the skbs, because it can build, send and free skbs within its loop,
      possibly reaching refcount zero and freeing the ubuf_info too soon.
      
      The UDP stack currently also takes this extra ref, but does not need
      it as all skbs are sent after return from __ip(6)_append_data.
      
      Avoid the extra refcount_inc and refcount_dec_and_test, and generally
      the sock_zerocopy_put in the common path, by passing the initial
      reference to the first skb.
      
      This approach is taken instead of initializing the refcount to 0, as
      that would generate error "refcount_t: increment on 0" on the
      next skb_zcopy_set.
      
      Changes
        v3 -> v4
          - Move skb_zcopy_set below the only kfree_skb that might cause
            a premature uarg destroy before skb_zerocopy_put_abort
            - Move the entire skb_shinfo assignment block, to keep that
              cacheline access in one place
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52900d22
    • W
      udp: msg_zerocopy · b5947e5d
      Willem de Bruijn 提交于
      Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
      interpret flag MSG_ZEROCOPY.
      
      This patch was previously part of the zerocopy RFC patchsets. Zerocopy
      is not effective at small MTU. With segmentation offload building
      larger datagrams, the benefit of page flipping outweights the cost of
      generating a completion notification.
      
      tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
      test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
      
          ipv4 udp -t 1
          tx=191312 (11938 MB) txc=0 zc=n
          rx=191312 (11938 MB)
          ipv4 udp -z -t 1
          tx=304507 (19002 MB) txc=304507 zc=y
          rx=304507 (19002 MB)
          ok
          ipv6 udp -t 1
          tx=174485 (10888 MB) txc=0 zc=n
          rx=174485 (10888 MB)
          ipv6 udp -z -t 1
          tx=294801 (18396 MB) txc=294801 zc=y
          rx=294801 (18396 MB)
          ok
      
      Changes
        v1 -> v2
          - Fixup reverse christmas tree violation
        v2 -> v3
          - Split refcount avoidance optimization into separate patch
            - Fix refcount leak on error in fragmented case
              (thanks to Paolo Abeni for pointing this one out!)
            - Fix refcount inc on zero
            - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
              This is needed since commit 5cf4a853 ("tcp: really ignore
      	MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5947e5d
  4. 26 11月, 2018 1 次提交
    • E
      net: remove unsafe skb_insert() · 4bffc669
      Eric Dumazet 提交于
      I do not see how one can effectively use skb_insert() without holding
      some kind of lock. Otherwise other cpus could have changed the list
      right before we have a chance of acquiring list->lock.
      
      Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
      one probably meant to use __skb_insert() since it appears nesqp->pau_list
      is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
      could be removed, since nesqp->pau_list.lock could be used instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Faisal Latif <faisal.latif@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: linux-rdma <linux-rdma@vger.kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bffc669
  5. 24 11月, 2018 1 次提交
    • W
      packet: copy user buffers before orphan or clone · 5cd8d46e
      Willem de Bruijn 提交于
      tpacket_snd sends packets with user pages linked into skb frags. It
      notifies that pages can be reused when the skb is released by setting
      skb->destructor to tpacket_destruct_skb.
      
      This can cause data corruption if the skb is orphaned (e.g., on
      transmit through veth) or cloned (e.g., on mirror to another psock).
      
      Create a kernel-private copy of data in these cases, same as tun/tap
      zerocopy transmission. Reuse that infrastructure: mark the skb as
      SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
      
      Unlike other zerocopy packets, do not set shinfo destructor_arg to
      struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
      when the original skb is released and a timestamp is recorded. Do not
      change this timestamp behavior. The ubuf_info->callback is not needed
      anyway, as no zerocopy notification is expected.
      
      Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
      resulting value is not a valid ubuf_info pointer, nor a valid
      tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
      
      The fix relies on features introduced in commit 52267790 ("sock:
      add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
      
      Tested with from `./in_netns.sh ./txring_overwrite` from
      http://github.com/wdebruij/kerneltools/tests
      
      Fixes: 69e3c75f ("net: TX_RING and packet mmap")
      Reported-by: NAnand H. Krishnan <anandhkrishnan@gmail.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cd8d46e
  6. 17 11月, 2018 3 次提交
  7. 07 11月, 2018 1 次提交
  8. 18 10月, 2018 1 次提交
  9. 03 10月, 2018 1 次提交
  10. 22 9月, 2018 1 次提交
  11. 20 9月, 2018 1 次提交
    • W
      flow_dissector: fix build failure without CONFIG_NET · 2dfd184a
      Willem de Bruijn 提交于
      If boolean CONFIG_BPF_SYSCALL is enabled, kernel/bpf/syscall.c will
      call flow_dissector functions from net/core/flow_dissector.c.
      
      This causes this build failure if CONFIG_NET is disabled:
      
          kernel/bpf/syscall.o: In function `__x64_sys_bpf':
          syscall.c:(.text+0x3278): undefined reference to
          `skb_flow_dissector_bpf_prog_attach'
          syscall.c:(.text+0x3310): undefined reference to
          `skb_flow_dissector_bpf_prog_detach'
          kernel/bpf/syscall.o:(.rodata+0x3f0): undefined reference to
          `flow_dissector_prog_ops'
          kernel/bpf/verifier.o:(.rodata+0x250): undefined reference to
          `flow_dissector_verifier_ops'
      
      Analogous to other optional BPF program types in syscall.c, add stubs
      if the relevant functions are not compiled and move the BPF_PROG_TYPE
      definition in the #ifdef CONFIG_NET block.
      
      Fixes: d58e468b ("flow_dissector: implements flow dissector BPF hook")
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dfd184a
  12. 15 9月, 2018 1 次提交
  13. 11 9月, 2018 3 次提交
  14. 10 8月, 2018 2 次提交
  15. 06 8月, 2018 2 次提交
  16. 19 7月, 2018 1 次提交
    • S
      net: Move skb decrypted field, avoid explicity copy · a48d189e
      Stefano Brivio 提交于
      Commit 784abe24 ("net: Add decrypted field to skb")
      introduced a 'decrypted' field that is explicitly copied on skb
      copy and clone.
      
      Move it between headers_start[0] and headers_end[0], so that we
      don't need to copy it explicitly as it's copied by the memcpy()
      in __copy_skb_header().
      
      While at it, drop the assignment in __skb_clone(), it was
      already redundant.
      
      This doesn't change the size of sk_buff or cacheline boundaries.
      
      The 15-bits hole before tc_index becomes a 14-bits hole, and
      will be again a 15-bits hole when this change is merged with
      commit 8b700862 ("net: Don't copy pfmemalloc flag in
      __copy_skb_header()").
      
      v2: as reported by kbuild test robot (oops, I forgot to build
          with CONFIG_TLS_DEVICE it seems), we can't use
          CHECK_SKB_FIELD() on a bit-field member. Just drop the
          check for the moment being, perhaps we could think of some
          magic to also check bit-field members one day.
      
      Fixes: 784abe24 ("net: Add decrypted field to skb")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a48d189e
  17. 16 7月, 2018 1 次提交
  18. 13 7月, 2018 1 次提交
    • S
      net: Don't copy pfmemalloc flag in __copy_skb_header() · 8b700862
      Stefano Brivio 提交于
      The pfmemalloc flag indicates that the skb was allocated from
      the PFMEMALLOC reserves, and the flag is currently copied on skb
      copy and clone.
      
      However, an skb copied from an skb flagged with pfmemalloc
      wasn't necessarily allocated from PFMEMALLOC reserves, and on
      the other hand an skb allocated that way might be copied from an
      skb that wasn't.
      
      So we should not copy the flag on skb copy, and rather decide
      whether to allow an skb to be associated with sockets unrelated
      to page reclaim depending only on how it was allocated.
      
      Move the pfmemalloc flag before headers_start[0] using an
      existing 1-bit hole, so that __copy_skb_header() doesn't copy
      it.
      
      When cloning, we'll now take care of this flag explicitly,
      contravening to the warning comment of __skb_clone().
      
      While at it, restore the newline usage introduced by commit
      b1937227 ("net: reorganize sk_buff for faster
      __copy_skb_header()") to visually separate bytes used in
      bitfields after headers_start[0], that was gone after commit
      a9e419dc ("netfilter: merge ctinfo into nfct pointer storage
      area"), and describe the pfmemalloc flag in the kernel-doc
      structure comment.
      
      This doesn't change the size of sk_buff or cacheline boundaries,
      but consolidates the 15 bits hole before tc_index into a 2 bytes
      hole before csum, that could now be filled more easily.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Fixes: c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC reserves")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b700862
  19. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  20. 26 6月, 2018 1 次提交
    • D
      net: Convert GRO SKB handling to list_head. · d4546c25
      David Miller 提交于
      Manage pending per-NAPI GRO packets via list_head.
      
      Return an SKB pointer from the GRO receive handlers.  When GRO receive
      handlers return non-NULL, it means that this SKB needs to be completed
      at this time and removed from the NAPI queue.
      
      Several operations are greatly simplified by this transformation,
      especially timing out the oldest SKB in the list when gro_count
      exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
      in reverse order.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4546c25
  21. 05 6月, 2018 1 次提交
  22. 26 5月, 2018 1 次提交
  23. 08 5月, 2018 1 次提交
    • P
      net: core: rework basic flow dissection helper · 72a338bc
      Paolo Abeni 提交于
      When the core networking needs to detect the transport offset in a given
      packet and parse it explicitly, a full-blown flow_keys struct is used for
      storage.
      This patch introduces a smaller keys store, rework the basic flow dissect
      helper to use it, and apply this new helper where possible - namely in
      skb_probe_transport_header(). The used flow dissector data structures
      are renamed to match more closely the new role.
      
      The above gives ~50% performance improvement in micro benchmarking around
      skb_probe_transport_header() and ~30% around eth_get_headlen(), mostly due
      to the smaller memset. Small, but measurable improvement is measured also
      in macro benchmarking.
      
      v1 -> v2: use the new helper in eth_get_headlen() and skb_get_poff(),
        as per DaveM suggestion
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72a338bc
  24. 01 5月, 2018 1 次提交
  25. 27 4月, 2018 1 次提交
    • W
      udp: add udp gso · ee80d1eb
      Willem de Bruijn 提交于
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80d1eb
  26. 20 4月, 2018 1 次提交
    • E
      net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends · 88078d98
      Eric Dumazet 提交于
      After working on IP defragmentation lately, I found that some large
      packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
      zero paddings on the last (small) fragment.
      
      While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
      to CHECKSUM_NONE, forcing a full csum validation, even if all prior
      fragments had CHECKSUM_COMPLETE set.
      
      We can instead compute the checksum of the part we are trimming,
      usually smaller than the part we keep.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88078d98
  27. 01 4月, 2018 1 次提交
    • E
      inet: frags: get rid of ipfrag_skb_cb/FRAG_CB · bf663371
      Eric Dumazet 提交于
      ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
      this integer is currently in a different cache line than skb->next,
      meaning that we use two cache lines per skb when finding the insertion point.
      
      By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
      in a single cache line and save precious memory bandwidth.
      
      Note that after the fast path added by Changli Gao in commit
      d6bebca9 ("fragment: add fast path for in-order fragments")
      this change wont help the fast path, since we still need
      to access prev->len (2nd cache line), but will show great
      benefits when slow path is entered, since we perform
      a linear scan of a potentially long list.
      
      Also, note that this potential long list is an attack vector,
      we might consider also using an rb-tree there eventually.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf663371
  28. 05 3月, 2018 2 次提交
  29. 04 3月, 2018 1 次提交