1. 07 2月, 2017 1 次提交
  2. 05 2月, 2017 2 次提交
    • D
      net: ipv6: Change notifications for multipath add to RTA_MULTIPATH · 3b1137fe
      David Ahern 提交于
      Change ip6_route_multipath_add to send one notifciation with the full
      route encoded with RTA_MULTIPATH instead of a series of individual routes.
      This is done by adding a skip_notify flag to the nl_info struct. The
      flag is used to skip sending of the notification in the fib code that
      actually inserts the route. Once the full route has been added, a
      notification is generated with all nexthops.
      
      ip6_route_multipath_add handles 3 use cases: new routes, route replace,
      and route append. The multipath notification generated needs to be
      consistent with the order of the nexthops and it should be consistent
      with the order in a FIB dump which means the route with the first nexthop
      needs to be used as the route reference. For the first 2 cases (new and
      replace), a reference to the route used to send the notification is
      obtained by saving the first route added. For the append case, the last
      route added is used to loop back to its first sibling route which is
      the first nexthop in the multipath route.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b1137fe
    • D
      net: ipv6: Allow shorthand delete of all nexthops in multipath route · 0ae81335
      David Ahern 提交于
      IPv4 allows multipath routes to be deleted using just the prefix and
      length. For example:
          $ ip ro ls vrf red
          unreachable default metric 8192
          1.1.1.0/24
              nexthop via 10.100.1.254  dev eth1 weight 1
              nexthop via 10.11.200.2  dev eth11.200 weight 1
          10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
          10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
      
          $ ip ro del 1.1.1.0/24 vrf red
      
          $ ip ro ls vrf red
          unreachable default metric 8192
          10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
          10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
      
      The same notation does not work with IPv6 because of how multipath routes
      are implemented for IPv6. For IPv6 only the first nexthop of a multipath
      route is deleted if the request contains only a prefix and length. This
      leads to unnecessary complexity in userspace dealing with IPv6 multipath
      routes.
      
      This patch allows all nexthops to be deleted without specifying each one
      in the delete request. Internally, this is done by walking the sibling
      list of the route matching the specifications given (prefix, length,
      metric, protocol, etc).
      
          $  ip -6 ro ls vrf red
          2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
          2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
          2001:db8:200::/120 via 2001:db8:1::2 dev eth1 metric 1024  pref medium
          2001:db8:200::/120 via 2001:db8:2::2 dev eth2 metric 1024  pref medium
          ...
      
          $ ip -6 ro del vrf red 2001:db8:200::/120
      
          $ ip -6 ro ls vrf red
          2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
          2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
          ...
      
      Because IPv6 allows individual nexthops to be deleted without deleting
      the entire route, the ip6_route_multipath_del and non-multipath code
      path (ip6_route_del) have to be discriminated so that all nexthops are
      only deleted for the latter case. This is done by making the existing
      fc_type in fib6_config a u16 and then adding a new u16 field with
      fc_delete_all_nh as the first bit.
      Suggested-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ae81335
  3. 04 2月, 2017 5 次提交
  4. 02 2月, 2017 8 次提交
  5. 01 2月, 2017 1 次提交
    • D
      ipv6: fix flow labels when the traffic class is non-0 · 90427ef5
      Dimitris Michailidis 提交于
      ip6_make_flowlabel() determines the flow label for IPv6 packets. It's
      supposed to be passed a flow label, which it returns as is if non-0 and
      in some other cases, otherwise it calculates a new value.
      
      The problem is callers often pass a flowi6.flowlabel, which may also
      contain traffic class bits. If the traffic class is non-0
      ip6_make_flowlabel() mistakes the non-0 it gets as a flow label and
      returns the whole thing. Thus it can return a 'flow label' longer than
      20b and the low 20b of that is typically 0 resulting in packets with 0
      label. Moreover, different packets of a flow may be labeled differently.
      For a TCP flow with ECN non-payload and payload packets get different
      labels as exemplified by this pair of consecutive packets:
      
      (pure ACK)
      Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
          0110 .... = Version: 6
          .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT)
              .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
              .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0)
          .... .... .... 0001 1100 1110 0100 1001 = Flow Label: 0x1ce49
          Payload Length: 32
          Next Header: TCP (6)
      
      (payload)
      Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
          0110 .... = Version: 6
          .... 0000 0010 .... .... .... .... .... = Traffic Class: 0x02 (DSCP: CS0, ECN: ECT(0))
              .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
              .... .... ..10 .... .... .... .... .... = Explicit Congestion Notification: ECN-Capable Transport codepoint '10' (2)
          .... .... .... 0000 0000 0000 0000 0000 = Flow Label: 0x00000
          Payload Length: 688
          Next Header: TCP (6)
      
      This patch allows ip6_make_flowlabel() to be passed more than just a
      flow label and has it extract the part it really wants. This was simpler
      than modifying the callers. With this patch packets like the above become
      
      Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
          0110 .... = Version: 6
          .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT)
              .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
              .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0)
          .... .... .... 1010 1111 1010 0101 1110 = Flow Label: 0xafa5e
          Payload Length: 32
          Next Header: TCP (6)
      
      Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
          0110 .... = Version: 6
          .... 0000 0010 .... .... .... .... .... = Traffic Class: 0x02 (DSCP: CS0, ECN: ECT(0))
              .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
              .... .... ..10 .... .... .... .... .... = Explicit Congestion Notification: ECN-Capable Transport codepoint '10' (2)
          .... .... .... 1010 1111 1010 0101 1110 = Flow Label: 0xafa5e
          Payload Length: 688
          Next Header: TCP (6)
      Signed-off-by: NDimitris Michailidis <dmichail@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90427ef5
  6. 31 1月, 2017 4 次提交
  7. 30 1月, 2017 4 次提交
  8. 28 1月, 2017 1 次提交
    • E
      net: adjust skb->truesize in pskb_expand_head() · 158f323b
      Eric Dumazet 提交于
      Slava Shwartsman reported a warning in skb_try_coalesce(), when we
      detect skb->truesize is completely wrong.
      
      In his case, issue came from IPv6 reassembly coping with malicious
      datagrams, that forced various pskb_may_pull() to reallocate a bigger
      skb->head than the one allocated by NIC driver before entering GRO
      layer.
      
      Current code does not change skb->truesize, leaving this burden to
      callers if they care enough.
      
      Blindly changing skb->truesize in pskb_expand_head() is not
      easy, as some producers might track skb->truesize, for example
      in xmit path for back pressure feedback (sk->sk_wmem_alloc)
      
      We can detect the cases where it should be safe to change
      skb->truesize :
      
      1) skb is not attached to a socket.
      2) If it is attached to a socket, destructor is sock_edemux()
      
      My audit gave only two callers doing their own skb->truesize
      manipulation.
      
      I had to remove skb parameter in sock_edemux macro when
      CONFIG_INET is not set to avoid a compile error.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSlava Shwartsman <slavash@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      158f323b
  9. 27 1月, 2017 3 次提交
  10. 26 1月, 2017 5 次提交
    • A
      net: dsa: Mop up remaining NET_DSA_HWMON references · 43450293
      Andrew Lunn 提交于
      Previous patches have moved the temperature sensor code into the
      Marvell PHYs. A few now dead references to NET_DSA_HWMON were left
      behind. Go reap them.
      Reported-by: NValentin Rothberg <valentinrothberg@gmail.com>
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43450293
    • W
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau 提交于
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3979ad7e
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • W
      net/tcp-fastopen: refactor cookie check logic · 065263f4
      Wei Wang 提交于
      Refactor the cookie check logic in tcp_send_syn_data() into a function.
      This function will be called else where in later changes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065263f4
    • J
      net sched actions: Add support for user cookies · 1045ba77
      Jamal Hadi Salim 提交于
      Introduce optional 128-bit action cookie.
      Like all other cookie schemes in the networking world (eg in protocols
      like http or existing kernel fib protocol field, etc) the idea is to save
      user state that when retrieved serves as a correlator. The kernel
      _should not_ intepret it.  The user can store whatever they wish in the
      128 bits.
      
      Sample exercise(showing variable length use of cookie)
      
      .. create an accept action with cookie a1b2c3d4
      sudo $TC actions add action ok index 1 cookie a1b2c3d4
      
      .. dump all gact actions..
      sudo $TC -s actions ls action gact
      
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 1 bind 0 installed 5 sec used 5 sec
          Action statistics:
          Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
          backlog 0b 0p requeues 0
          cookie a1b2c3d4
      
      .. bind the accept action to a filter..
      sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
      u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1
      
      ... send some traffic..
      $ ping 127.0.0.1 -c 3
      PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
      64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
      64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
      64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1045ba77
  11. 25 1月, 2017 5 次提交
    • R
      net: Specify the owning module for lwtunnel ops · 88ff7334
      Robert Shearman 提交于
      Modules implementing lwtunnel ops should not be allowed to unload
      while there is state alive using those ops, so specify the owning
      module for all lwtunnel ops.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88ff7334
    • P
      netfilter: nf_tables: deconstify walk callback function · de70185d
      Pablo Neira Ayuso 提交于
      The flush operation needs to modify set and element objects, so let's
      deconstify this.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      de70185d
    • Y
      net/sched: Introduce sample tc action · 5c5670fa
      Yotam Gigi 提交于
      This action allows the user to sample traffic matched by tc classifier.
      The sampling consists of choosing packets randomly and sampling them using
      the psample module. The user can configure the psample group number, the
      sampling rate and the packet's truncation (to save kernel-user traffic).
      
      Example:
      To sample ingress traffic from interface eth1, one may use the commands:
      
      tc qdisc add dev eth1 handle ffff: ingress
      
      tc filter add dev eth1 parent ffff: \
      	   matchall action sample rate 12 group 4
      
      Where the first command adds an ingress qdisc and the second starts
      sampling randomly with an average of one sampled packet per 12 packets on
      dev eth1 to psample group 4.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c5670fa
    • Y
      net: Introduce psample, a new genetlink channel for packet sampling · 6ae0a628
      Yotam Gigi 提交于
      Add a general way for kernel modules to sample packets, without being tied
      to any specific subsystem. This netlink channel can be used by tc,
      iptables, etc. and allow to standardize packet sampling in the kernel.
      
      For every sampled packet, the psample module adds the following metadata
      fields:
      
      PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable
      
      PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable
      
      PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
         truncated during sampling
      
      PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
         user who initiated the sampling. This field allows the user to
         differentiate between several samplers working simultaneously and
         filter packets relevant to him
      
      PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
         sequence is kept for each group
      
      PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets
      
      PSAMPLE_ATTR_DATA - the actual packet bits
      
      The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
      group. In addition, add the GET_GROUPS netlink command which allows the
      user to see the current sample groups, their refcount and sequence number.
      This command currently supports only netlink dump mode.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae0a628
    • K
      Introduce a sysctl that modifies the value of PROT_SOCK. · 4548b683
      Krister Johansen 提交于
      Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
      that denotes the first unprivileged inet port in the namespace.  To
      disable all privileged ports set this to zero.  It also checks for
      overlap with the local port range.  The privileged and local range may
      not overlap.
      
      The use case for this change is to allow containerized processes to bind
      to priviliged ports, but prevent them from ever being allowed to modify
      their container's network configuration.  The latter is accomplished by
      ensuring that the network namespace is not a child of the user
      namespace.  This modification was needed to allow the container manager
      to disable a namespace's priviliged port restrictions without exposing
      control of the network namespace to processes in the user namespace.
      Signed-off-by: NKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4548b683
  12. 21 1月, 2017 1 次提交