1. 26 1月, 2017 15 次提交
    • W
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau 提交于
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3979ad7e
    • D
      Merge branch 'tcp-fastopen-new-API' · eb92f76e
      David S. Miller 提交于
      Wei Wang says:
      
      ====================
      net/tcp-fastopen: Add new userspace API support
      
      The patch series is to add support for new userspace API for TCP fastopen
      sockets.
      In the current code, user has to call sendto()/sendmsg() with special flag
      MSG_FASTOPEN for TCP fastopen sockets. This API is quite different from the
      normal TCP socket API and can be cumbersome for applications to make use
      fastopen sockets.
      So this new patch introduces a new way of using TCP fastopen sockets which
      is similar to normal TCP sockets with a new sockopt TCP_FASTOPEN_CONNECT.
      More details about it is described in the third patch.
      (First 2 patches are preparations for the third patch.)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb92f76e
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • W
      net: Remove __sk_dst_reset() in tcp_v6_connect() · 25776aa9
      Wei Wang 提交于
      Remove __sk_dst_reset() in the failure handling because __sk_dst_reset()
      will eventually get called when sk is released. No need to handle it in
      the protocol specific connect call.
      This is also to make the code path consistent with ipv4.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25776aa9
    • W
      net/tcp-fastopen: refactor cookie check logic · 065263f4
      Wei Wang 提交于
      Refactor the cookie check logic in tcp_send_syn_data() into a function.
      This function will be called else where in later changes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065263f4
    • H
      r8152: fix the wrong spelling · a9c54ad2
      hayeswang 提交于
      Replace rumtime with runtime.
      Signed-off-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9c54ad2
    • A
      Doc: DT: bindings: net: dsa: marvell.txt: Tabification · d2345599
      Andrew Lunn 提交于
      Replace spaces with tabs. Fix indentation to be multiples of tabs, not
      a mixture or tabs and spaces.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2345599
    • D
      Merge branch 'bpf-tracepoints' · cca316f3
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      BPF tracepoints
      
      This set adds tracepoints to BPF for better introspection and
      debugging. The first two patches are prerequisite for the actual
      third patch that adds the tracepoints. I think the first two are
      small and straight forward enough that they could ideally go via
      net-next, but I'm also open to other suggestions on how to route
      them in case that's not applicable (it would reduce potential
      merge conflicts on BPF side, though). For details, please see
      individual patches.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cca316f3
    • D
      bpf: add initial bpf tracepoints · a67edbf4
      Daniel Borkmann 提交于
      This work adds a number of tracepoints to paths that are either
      considered slow-path or exception-like states, where monitoring or
      inspecting them would be desirable.
      
      For bpf(2) syscall, tracepoints have been placed for main commands
      when they succeed. In XDP case, tracepoint is for exceptions, that
      is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
      return code, or when error occurs during XDP_TX action and the packet
      could not be forwarded.
      
      Both have been split into separate event headers, and can be further
      extended. Worst case, if they unexpectedly should get into our way in
      future, they can also removed [1]. Of course, these tracepoints (like
      any other) can be analyzed by eBPF itself, etc. Example output:
      
        # ./perf record -a -e bpf:* sleep 10
        # ./perf script
        sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
        sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
        sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
        sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
        [...]
        sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
             swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
      
        [1] https://lwn.net/Articles/705270/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a67edbf4
    • D
      lib, traceevent: add PRINT_HEX_STR variant · 0fe05591
      Daniel Borkmann 提交于
      Add support for the __print_hex_str() macro that was added for
      tracing, so that user space tools such as perf can understand
      it as well.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fe05591
    • D
      trace: add variant without spacing in trace_print_hex_seq · 2acae0d5
      Daniel Borkmann 提交于
      For upcoming tracepoint support for BPF, we want to dump the program's
      tag. Format should be similar to __print_hex(), but without spacing.
      Add a __print_hex_str() variant for exactly that purpose that reuses
      trace_print_hex_seq().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2acae0d5
    • E
      tcp: reduce skb overhead in selected places · 60b1af33
      Eric Dumazet 提交于
      tcp_add_backlog() can use skb_condense() helper to get better
      gains and less SKB_TRUESIZE() magic. This only happens when socket
      backlog has to be used.
      
      Some attacks involve specially crafted out of order tiny TCP packets,
      clogging the ofo queue of (many) sockets.
      Then later, expensive collapse happens, trying to copy all these skbs
      into single ones.
      This unfortunately does not work if each skb has no neighbor in TCP
      sequence order.
      
      By using skb_condense() if the skb could not be coalesced to a prior
      one, we defeat these kind of threats, potentially saving 4K per skb
      (or more, since this is one page fragment).
      
      A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
      in skb->head, meaning the copy done by skb_condense() is limited to
      about 200 bytes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60b1af33
    • D
      Merge tag 'mlx5-updates-2017-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 716dcaeb
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2017-24-01
      
      The first seven patches from Or Gerlitz in this series further enhances
      the mlx5 SRIOV switchdev mode to support offloading IPv6 tunnels using the
      TC tunnel key set (encap) and unset (decap) actions.
      
      Or Gerlitz says:
      ========================
      As part of doing this change, few cleanups are done in the IPv4 code,
      later we move to use the full tunnel key info provided to the driver as
      the key for our internal hashing which is used to identify cases where
      the same tunnel is used for encapsulating multiple flows. As done in the
      IPv4 case, the control path for offloading IPv6 tunnels uses route/neigh
      lookups and construction of the IPv6 tunnel headers on the encap path and
      matching on the outer hears in the decap path.
      
      The last patch of the series enlarges the HW FDB size for the switchdev mode,
      so it has now room to contain offloaded flows as many as min(max number
      of HW flow counters supported, max HW table size supported).
      ========================
      
      Next to Or's series you can find several patches handling several topics.
      
      From Mohamad, add support for SRIOV VF min rate guarantee by using the
      TSAR BW share weights mechanism.
      
      From Or, Two patches to enable Eth VFs to query their min-inline value for
      user-space.
      for that we move a mlx5 low level min inline helper function from mlx5
      ethernet driver into the core driver and then use it in mlx5_ib to expose
      the inline mode to rdma applications through libmlx5.
      
      From Kamal Heib, Reduce memory consumption on kdump kernel.
      
      From Shaker Daibes, code reuse in CQE compression control logic
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      716dcaeb
    • D
      tipc: uninitialized return code in tipc_setsockopt() · a08ef476
      Dan Carpenter 提交于
      We shuffled some code around and added some new case statements here and
      now "res" isn't initialized on all paths.
      
      Fixes: 01fd12bb ("tipc: make replicast a user selectable option")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a08ef476
    • J
      net sched actions: Add support for user cookies · 1045ba77
      Jamal Hadi Salim 提交于
      Introduce optional 128-bit action cookie.
      Like all other cookie schemes in the networking world (eg in protocols
      like http or existing kernel fib protocol field, etc) the idea is to save
      user state that when retrieved serves as a correlator. The kernel
      _should not_ intepret it.  The user can store whatever they wish in the
      128 bits.
      
      Sample exercise(showing variable length use of cookie)
      
      .. create an accept action with cookie a1b2c3d4
      sudo $TC actions add action ok index 1 cookie a1b2c3d4
      
      .. dump all gact actions..
      sudo $TC -s actions ls action gact
      
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 1 bind 0 installed 5 sec used 5 sec
          Action statistics:
          Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
          backlog 0b 0p requeues 0
          cookie a1b2c3d4
      
      .. bind the accept action to a filter..
      sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
      u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1
      
      ... send some traffic..
      $ ping 127.0.0.1 -c 3
      PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
      64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
      64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
      64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1045ba77
  2. 25 1月, 2017 25 次提交