1. 10 8月, 2017 9 次提交
  2. 09 8月, 2017 1 次提交
    • W
      net: avoid skb_warn_bad_offload false positives on UFO · 8d63bee6
      Willem de Bruijn 提交于
      skb_warn_bad_offload triggers a warning when an skb enters the GSO
      stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
      checksum offload set.
      
      Commit b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      observed that SKB_GSO_DODGY producers can trigger the check and
      that passing those packets through the GSO handlers will fix it
      up. But, the software UFO handler will set ip_summed to
      CHECKSUM_NONE.
      
      When __skb_gso_segment is called from the receive path, this
      triggers the warning again.
      
      Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
      Tx these two are equivalent. On Rx, this better matches the
      skb state (checksum computed), as CHECKSUM_NONE here means no
      checksum computed.
      
      See also this thread for context:
      http://patchwork.ozlabs.org/patch/799015/
      
      Fixes: b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d63bee6
  3. 08 8月, 2017 2 次提交
  4. 04 8月, 2017 10 次提交
    • W
      sock: ulimit on MSG_ZEROCOPY pages · a91dbff5
      Willem de Bruijn 提交于
      Bound the number of pages that a user may pin.
      
      Follow the lead of perf tools to maintain a per-user bound on memory
      locked pages commit 789f90fc ("perf_counter: per user mlock gift")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a91dbff5
    • W
      sock: MSG_ZEROCOPY notification coalescing · 4ab6c99d
      Willem de Bruijn 提交于
      In the simple case, each sendmsg() call generates data and eventually
      a zerocopy ready notification N, where N indicates the Nth successful
      invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.
      
      TCP and corked sockets can cause send() calls to append new data to an
      existing sk_buff and, thus, ubuf_info. In that case the notification
      must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
      and add skb_zerocopy_realloc() to optionally extend an existing range.
      
      Also coalesce notifications in this common case: if a notification
      [1, 1] is about to be queued while [0, 0] is the queue tail, just modify
      the head of the queue to read [0, 1].
      
      Coalescing is limited to a few TSO frames worth of data to bound
      notification latency.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ab6c99d
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
    • W
      sock: add SOCK_ZEROCOPY sockopt · 76851d12
      Willem de Bruijn 提交于
      The send call ignores unknown flags. Legacy applications may already
      unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
      socket opts in to zerocopy.
      
      Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
      Processes can also query this socket option to detect kernel support
      for the feature. Older kernels will return ENOPROTOOPT.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76851d12
    • W
      sock: add MSG_ZEROCOPY · 52267790
      Willem de Bruijn 提交于
      The kernel supports zerocopy sendmsg in virtio and tap. Expand the
      infrastructure to support other socket types. Introduce a completion
      notification channel over the socket error queue. Notifications are
      returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
      blocking the send/recv path on receiving notifications.
      
      Add reference counting, to support the skb split, merge, resize and
      clone operations possible with SOCK_STREAM and other socket types.
      
      The patch does not yet modify any datapaths.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52267790
    • W
      sock: skb_copy_ubufs support for compound pages · 3ece7826
      Willem de Bruijn 提交于
      Refine skb_copy_ubufs to support compound pages. With upcoming TCP
      zerocopy sendmsg, such fragments may appear.
      
      The existing code replaces each page one for one. Splitting each
      compound page into an independent number of regular pages can result
      in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.
      
      Instead, fill all destination pages but the last to PAGE_SIZE.
      Split the existing alloc + copy loop into separate stages:
      1. compute bytelength and minimum number of pages to store this.
      2. allocate
      3. copy, filling each page except the last to PAGE_SIZE bytes
      4. update skb frag array
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ece7826
    • W
      sock: allocate skbs from optmem · 98ba0bd5
      Willem de Bruijn 提交于
      Add sock_omalloc and sock_ofree to be able to allocate control skbs,
      for instance for looping errors onto sk_error_queue.
      
      The transmit budget (sk_wmem_alloc) is involved in transmit skb
      shaping, most notably in TCP Small Queues. Using this budget for
      control packets would impact transmission.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98ba0bd5
    • I
      net: fib_rules: Implement notification logic in core · 1b2a4440
      Ido Schimmel 提交于
      Unlike the routing tables, the FIB rules share a common core, so instead
      of replicating the same logic for each address family we can simply dump
      the rules and send notifications from the core itself.
      
      To protect the integrity of the dump, a rules-specific sequence counter
      is added for each address family and incremented whenever a rule is
      added or deleted (under RTNL).
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b2a4440
    • I
      net: core: Make the FIB notification chain generic · 04b1d4e5
      Ido Schimmel 提交于
      The FIB notification chain is currently soley used by IPv4 code.
      However, we're going to introduce IPv6 FIB offload support, which
      requires these notification as well.
      
      As explained in commit c3852ef7 ("ipv4: fib: Replay events when
      registering FIB notifier"), upon registration to the chain, the callee
      receives a full dump of the FIB tables and rules by traversing all the
      net namespaces. The integrity of the dump is ensured by a per-namespace
      sequence counter that is incremented whenever a change to the tables or
      rules occurs.
      
      In order to allow more address families to use the chain, each family is
      expected to register its fib_notifier_ops in its pernet init. These
      operations allow the common code to read the family's sequence counter
      as well as dump its tables and rules in the given net namespace.
      
      Additionally, a 'family' parameter is added to sent notifications, so
      that listeners could distinguish between the different families.
      
      Implement the common code that allows listeners to register to the chain
      and for address families to register their fib_notifier_ops. Subsequent
      patches will implement these operations in IPv6.
      
      In the future, ipmr and ip6mr will be extended to provide these
      notifications as well.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04b1d4e5
    • W
      bpf: fix the printing of ifindex · eb48d682
      William Tu 提交于
      Save the ifindex before it gets zeroed so the invalid
      ifindex can be printed out.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb48d682
  5. 03 8月, 2017 1 次提交
  6. 02 8月, 2017 3 次提交
  7. 30 7月, 2017 2 次提交
    • V
      net: ethtool: add support for forward error correction modes · 1a5f3da2
      Vidya Sagar Ravipati 提交于
      Forward Error Correction (FEC) modes i.e Base-R
      and Reed-Solomon modes are introduced in 25G/40G/100G standards
      for providing good BER at high speeds. Various networking devices
      which support 25G/40G/100G provides ability to manage supported FEC
      modes and the lack of FEC encoding control and reporting today is a
      source for interoperability issues for many vendors.
      FEC capability as well as specific FEC mode i.e. Base-R
      or RS modes can be requested or advertised through bits D44:47 of
      base link codeword.
      
      This patch set intends to provide option under ethtool to manage
      and report FEC encoding settings for networking devices as per
      IEEE 802.3 bj, bm and by specs.
      
      set-fec/show-fec option(s) are designed to provide control and
      report the FEC encoding on the link.
      
      SET FEC option:
      root@tor: ethtool --set-fec  swp1 encoding [off | RS | BaseR | auto]
      
      Encoding: Types of encoding
      Off    :  Turning off any encoding
      RS     :  enforcing RS-FEC encoding on supported speeds
      BaseR  :  enforcing Base R encoding on supported speeds
      Auto   :  IEEE defaults for the speed/medium combination
      
      Here are a few examples of what we would expect if encoding=auto:
      - if autoneg is on, we are  expecting FEC to be negotiated as on or off
        as long as protocol supports it
      - if the hardware is capable of detecting the FEC encoding on it's
            receiver it will reconfigure its encoder to match
      - in absence of the above, the configuration would be set to IEEE
        defaults.
      
      >From our  understanding , this is essentially what most hardware/driver
      combinations are doing today in the absence of a way for users to
      control the behavior.
      
      SHOW FEC option:
      root@tor: ethtool --show-fec  swp1
      FEC parameters for swp1:
      Active FEC encodings: RS
      Configured FEC encodings:  RS | BaseR
      
      ETHTOOL DEVNAME output modification:
      
      ethtool devname output:
      root@tor:~# ethtool swp1
      Settings for swp1:
      root@hpe-7712-03:~# ethtool swp18
      Settings for swp18:
          Supported ports: [ FIBRE ]
          Supported link modes:   40000baseCR4/Full
                                  40000baseSR4/Full
                                  40000baseLR4/Full
                                  100000baseSR4/Full
                                  100000baseCR4/Full
                                  100000baseLR4_ER4/Full
          Supported pause frame use: No
          Supports auto-negotiation: Yes
          Supported FEC modes: [RS | BaseR | None | Not reported]
          Advertised link modes:  Not reported
          Advertised pause frame use: No
          Advertised auto-negotiation: No
          Advertised FEC modes: [RS | BaseR | None | Not reported]
      <<<< One or more FEC modes
          Speed: 100000Mb/s
          Duplex: Full
          Port: FIBRE
          PHYAD: 106
          Transceiver: internal
          Auto-negotiation: off
          Link detected: yes
      
      This patch includes following changes
      a) New ETHTOOL_SFECPARAM/SFECPARAM API, handled by
        the new get_fecparam/set_fecparam callbacks, provides support
        for configuration of forward error correction modes.
      b) Link mode bits for FEC modes i.e. None (No FEC mode), RS, BaseR/FC
        are defined so that users can configure these fec modes for supported
        and advertising fields as part of link autonegotiation.
      Signed-off-by: NVidya Sagar Ravipati <vidya.chowdary@gmail.com>
      Signed-off-by: NDustin Byford <dustin@cumulusnetworks.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a5f3da2
    • W
      net: check dev->addr_len for dev_set_mac_address() · 0254e0c6
      WANG Cong 提交于
      Historically, dev_ifsioc() uses struct sockaddr as mac
      address definition, this is why dev_set_mac_address()
      accepts a struct sockaddr pointer as input but now we
      have various types of mac addresse whose lengths
      are up to MAX_ADDR_LEN, longer than struct sockaddr,
      and saved in dev->addr_len.
      
      It is too late to fix dev_ifsioc() due to API
      compatibility, so just reject those larger than
      sizeof(struct sockaddr), otherwise we would read
      and use some random bytes from kernel stack.
      
      Fortunately, only a few IPv6 tunnel devices have addr_len
      larger than sizeof(struct sockaddr) and they don't support
      ndo_set_mac_addr(). But with team driver, in lb mode, they
      can still be enslaved to a team master and make its mac addr
      length as the same.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0254e0c6
  8. 27 7月, 2017 1 次提交
  9. 25 7月, 2017 3 次提交
  10. 21 7月, 2017 1 次提交
  11. 20 7月, 2017 4 次提交
  12. 19 7月, 2017 1 次提交
  13. 18 7月, 2017 2 次提交