1. 05 8月, 2017 16 次提交
  2. 04 8月, 2017 24 次提交
    • W
      tcp: enable MSG_ZEROCOPY · f214f915
      Willem de Bruijn 提交于
      Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
      both supported. Only data sent to remote destinations is sent without
      copying. Packets looped onto a local destination have their payload
      copied to avoid unbounded latency.
      
      Tested:
        A 10x TCP_STREAM between two hosts showed a reduction in netserver
        process cycles by up to 70%, depending on packet size. Systemwide,
        savings are of course much less pronounced, at up to 20% best case.
      
        msg_zerocopy.sh 4 tcp:
      
        without zerocopy
          tx=121792 (7600 MB) txc=0 zc=n
          rx=60458 (7600 MB)
      
        with zerocopy
          tx=286257 (17863 MB) txc=286257 zc=y
          rx=140022 (17863 MB)
      
        This test opens a pair of sockets over veth, one one calls send with
        64KB and optionally MSG_ZEROCOPY and on the other reads the initial
        bytes. The receiver truncates, so this is strictly an upper bound on
        what is achievable. It is more representative of sending data out of
        a physical NIC (when payload is not touched, either).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f214f915
    • W
      sock: ulimit on MSG_ZEROCOPY pages · a91dbff5
      Willem de Bruijn 提交于
      Bound the number of pages that a user may pin.
      
      Follow the lead of perf tools to maintain a per-user bound on memory
      locked pages commit 789f90fc ("perf_counter: per user mlock gift")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a91dbff5
    • W
      sock: MSG_ZEROCOPY notification coalescing · 4ab6c99d
      Willem de Bruijn 提交于
      In the simple case, each sendmsg() call generates data and eventually
      a zerocopy ready notification N, where N indicates the Nth successful
      invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.
      
      TCP and corked sockets can cause send() calls to append new data to an
      existing sk_buff and, thus, ubuf_info. In that case the notification
      must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
      and add skb_zerocopy_realloc() to optionally extend an existing range.
      
      Also coalesce notifications in this common case: if a notification
      [1, 1] is about to be queued while [0, 0] is the queue tail, just modify
      the head of the queue to read [0, 1].
      
      Coalescing is limited to a few TSO frames worth of data to bound
      notification latency.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ab6c99d
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
    • W
      sock: add SOCK_ZEROCOPY sockopt · 76851d12
      Willem de Bruijn 提交于
      The send call ignores unknown flags. Legacy applications may already
      unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
      socket opts in to zerocopy.
      
      Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
      Processes can also query this socket option to detect kernel support
      for the feature. Older kernels will return ENOPROTOOPT.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76851d12
    • W
      sock: add MSG_ZEROCOPY · 52267790
      Willem de Bruijn 提交于
      The kernel supports zerocopy sendmsg in virtio and tap. Expand the
      infrastructure to support other socket types. Introduce a completion
      notification channel over the socket error queue. Notifications are
      returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
      blocking the send/recv path on receiving notifications.
      
      Add reference counting, to support the skb split, merge, resize and
      clone operations possible with SOCK_STREAM and other socket types.
      
      The patch does not yet modify any datapaths.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52267790
    • W
      sock: skb_copy_ubufs support for compound pages · 3ece7826
      Willem de Bruijn 提交于
      Refine skb_copy_ubufs to support compound pages. With upcoming TCP
      zerocopy sendmsg, such fragments may appear.
      
      The existing code replaces each page one for one. Splitting each
      compound page into an independent number of regular pages can result
      in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.
      
      Instead, fill all destination pages but the last to PAGE_SIZE.
      Split the existing alloc + copy loop into separate stages:
      1. compute bytelength and minimum number of pages to store this.
      2. allocate
      3. copy, filling each page except the last to PAGE_SIZE bytes
      4. update skb frag array
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ece7826
    • W
      sock: allocate skbs from optmem · 98ba0bd5
      Willem de Bruijn 提交于
      Add sock_omalloc and sock_ofree to be able to allocate control skbs,
      for instance for looping errors onto sk_error_queue.
      
      The transmit budget (sk_wmem_alloc) is involved in transmit skb
      shaping, most notably in TCP Small Queues. Using this budget for
      control packets would impact transmission.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98ba0bd5
    • I
      ipv6: fib: Add helpers to hold / drop a reference on rt6_info · a460aa83
      Ido Schimmel 提交于
      Similar to commit 1c677b3d ("ipv4: fib: Add fib_info_hold() helper")
      and commit b423cb10 ("ipv4: fib: Export free_fib_info()") add an
      helper to hold a reference on rt6_info and export rt6_release() to drop
      it and potentially release the route.
      
      This is needed so that drivers capable of FIB offload could hold a
      reference on the route before queueing it for offload and drop it after
      the route has been programmed to the device's tables.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a460aa83
    • I
      ipv6: Regenerate host route according to node pointer upon interface up · fc882fcf
      Ido Schimmel 提交于
      When an interface is brought back up, the kernel tries to restore the
      host routes tied to its permanent addresses.
      
      However, if the host route was removed from the FIB, then we need to
      reinsert it. This is done by releasing the current dst and allocating a
      new, so as to not reuse a dst with obsolete values.
      
      Since this function is called under RTNL and using the same explanation
      from the previous patch, we can test if the route is in the FIB by
      checking its node pointer instead of its reference count.
      
      Tested using the following script and Andrey's reproducer mentioned
      in commit 8048ced9 ("net: ipv6: regenerate host route if moved to gc
      list") and linked below:
      
      $ ip link set dev lo up
      $ ip link add dummy1 type dummy
      $ ip -6 address add cafe::1/64 dev dummy1
      $ ip link set dev lo down	# cafe::1/128 is removed
      $ ip link set dev dummy1 up
      $ ip link set dev lo up
      
      The host route is correctly regenerated.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Link: http://lkml.kernel.org/r/CAAeHK+zSe82vc5gCRgr_EoUwiALPnWVdWJBPwJZBpbxYz=kGJw@mail.gmail.comSigned-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc882fcf
    • I
      ipv6: Regenerate host route according to node pointer upon loopback up · 9217d8c2
      Ido Schimmel 提交于
      When the loopback device is brought back up we need to check if the host
      route attached to the address is still in the FIB and regenerate one in
      case it's not.
      
      Host routes using the loopback device are always inserted into and
      removed from the FIB under RTNL (under which this function is called),
      so we can test their node pointer instead of the reference count in
      order to check if the route is in the FIB or not.
      
      Tested using the following script from Nicolas mentioned in
      commit a220445f ("ipv6: correctly add local routes when lo goes up"):
      
      $ ip link add dummy1 type dummy
      $ ip link set dummy1 up
      $ ip link set lo down ; ip link set lo up
      
      The host route is correctly regenerated.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9217d8c2
    • I
      ipv6: fib: Unlink replaced routes from their nodes · 7483cea7
      Ido Schimmel 提交于
      When a route is deleted its node pointer is set to NULL to indicate it's
      no longer linked to its node. Do the same for routes that are replaced.
      
      This will later allow us to test if a route is still in the FIB by
      checking its node pointer instead of its reference count.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7483cea7
    • I
      ipv6: fib: Don't assume only nodes hold a reference on routes · c5b12410
      Ido Schimmel 提交于
      The code currently assumes that only FIB nodes can hold a reference on
      routes. Therefore, after fib6_purge_rt() has run and the route is no
      longer present in any intermediate nodes, it's assumed that its
      reference count would be 1 - taken by the node where it's currently
      stored.
      
      However, we're going to allow users other than the FIB to take a
      reference on a route, so this assumption is no longer valid and the
      BUG_ON() needs to be removed.
      
      Note that purging only takes place if the initial reference count is
      different than 1. I've left that check intact, as in the majority of
      systems (where routes are only referenced by the FIB), it does actually
      mean the route is present in intermediate nodes.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5b12410
    • I
      ipv6: fib: Add offload indication to routes · 61e4d01e
      Ido Schimmel 提交于
      Allow user space applications to see which routes are offloaded and
      which aren't by setting the RTNH_F_OFFLOAD flag when dumping them.
      
      To be consistent with IPv4, offload indication is provided on a
      per-nexthop basis.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61e4d01e
    • I
      ipv6: fib: Dump tables during registration to FIB chain · e1ee0a5b
      Ido Schimmel 提交于
      Dump all the FIB tables in each net namespace upon registration to the
      FIB notification chain so that the callee will have a complete view of
      the tables.
      
      The integrity of the dump is ensured by a per-table sequence counter
      that is incremented (under write lock) whenever a route is added or
      deleted from the table.
      
      All the sequence counters are read (under each table's read lock) and
      summed, prior and after the dump. In case the counters differ, then the
      dump is either restarted or the registration fails.
      
      While it's possible for a table to be modified after its counter has
      been read, this isn't really a problem. In case it happened before it
      was read the second time, then the comparison at the end will fail. If
      it happened afterwards, then we're guaranteed to be notified about the
      change, as the notification block is registered prior to the second
      read.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1ee0a5b
    • I
      ipv6: fib_rules: Dump rules during registration to FIB chain · dcb18f76
      Ido Schimmel 提交于
      Allow users of the FIB notification chain to receive a complete view of
      the IPv6 FIB rules upon registration to the chain.
      
      The integrity of the dump is ensured by a per-family sequence counter
      that is incremented (under RTNL) whenever a rule is added or deleted.
      
      All the sequence counters are read (under RTNL) and summed, prior and
      after the dump. In case the counters differ, then the dump is either
      restarted or the registration fails.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcb18f76
    • I
      ipv6: fib: Add in-kernel notifications for route add / delete · df77fe4d
      Ido Schimmel 提交于
      As with IPv4, allow listeners of the FIB notification chain to receive
      notifications whenever a route is added, replaced or deleted. This is
      done by placing calls to the FIB notification chain in the two lowest
      level functions that end up performing these operations - namely,
      fib6_add_rt2node() and fib6_del_route().
      
      Unlike IPv4, APPEND notifications aren't sent as the kernel doesn't
      distinguish between "append" (NLM_F_CREATE|NLM_F_APPEND) and "prepend"
      (NLM_F_CREATE). If NLM_F_EXCL isn't set, duplicate routes are always
      added after the existing duplicate routes.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df77fe4d
    • I
      ipv6: fib: Add FIB notifiers callbacks · 16ab6d7d
      Ido Schimmel 提交于
      We're about to add IPv6 FIB offload support, so implement the necessary
      callbacks in IPv6 code, which will later allow us to add routes and
      rules notifications.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16ab6d7d
    • I
      ipv6: fib_rules: Check if rule is a default rule · e3ea9731
      Ido Schimmel 提交于
      As explained in commit 3c71006d ("ipv4: fib_rules: Check if rule is
      a default rule"), drivers supporting IPv6 FIB offload need to be able to
      sanitize the rules they don't support and potentially flush their
      tables.
      
      Add an IPv6 helper to check if a FIB rule is a default rule.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3ea9731
    • I
      net: fib_rules: Implement notification logic in core · 1b2a4440
      Ido Schimmel 提交于
      Unlike the routing tables, the FIB rules share a common core, so instead
      of replicating the same logic for each address family we can simply dump
      the rules and send notifications from the core itself.
      
      To protect the integrity of the dump, a rules-specific sequence counter
      is added for each address family and incremented whenever a rule is
      added or deleted (under RTNL).
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b2a4440
    • I
      net: core: Make the FIB notification chain generic · 04b1d4e5
      Ido Schimmel 提交于
      The FIB notification chain is currently soley used by IPv4 code.
      However, we're going to introduce IPv6 FIB offload support, which
      requires these notification as well.
      
      As explained in commit c3852ef7 ("ipv4: fib: Replay events when
      registering FIB notifier"), upon registration to the chain, the callee
      receives a full dump of the FIB tables and rules by traversing all the
      net namespaces. The integrity of the dump is ensured by a per-namespace
      sequence counter that is incremented whenever a change to the tables or
      rules occurs.
      
      In order to allow more address families to use the chain, each family is
      expected to register its fib_notifier_ops in its pernet init. These
      operations allow the common code to read the family's sequence counter
      as well as dump its tables and rules in the given net namespace.
      
      Additionally, a 'family' parameter is added to sent notifications, so
      that listeners could distinguish between the different families.
      
      Implement the common code that allows listeners to register to the chain
      and for address families to register their fib_notifier_ops. Subsequent
      patches will implement these operations in IPv6.
      
      In the future, ipmr and ip6mr will be extended to provide these
      notifications as well.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04b1d4e5
    • X
      sctp: remove the typedef sctp_auth_chunk_t · bb96dec7
      Xin Long 提交于
      This patch is to remove the typedef sctp_auth_chunk_t, and
      replace with struct sctp_auth_chunk in the places where it's
      using this typedef.
      
      It is also to use sizeof(variable) instead of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb96dec7
    • X
      sctp: remove the typedef sctp_authhdr_t · 96f7ef4d
      Xin Long 提交于
      This patch is to remove the typedef sctp_authhdr_t, and
      replace with struct sctp_authhdr in the places where it's
      using this typedef.
      
      It is also to use sizeof(variable) instead of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96f7ef4d
    • X
      sctp: remove the typedef sctp_addip_chunk_t · 68d75469
      Xin Long 提交于
      This patch is to remove the typedef sctp_addip_chunk_t, and
      replace with struct sctp_addip_chunk in the places where it's
      using this typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68d75469