1. 21 3月, 2016 2 次提交
    • D
      ipv6, trace: fix tos reporting on fib6_table_lookup · 69716a2b
      Daniel Borkmann 提交于
      flowi6_tos of struct flowi6 is unused in IPv6, therefore dumping tos on
      that tracepoint will also give incorrect information wrt traffic class.
      
      If we want to fix it, we need to extract it via ip6_tclass(flp->flowlabel).
      While for the same test case I get a count of 0 non-zero tos values before
      the change, they now start to show up after the change:
      
        # ./perf record -e fib6:fib6_table_lookup -a sleep 10
        # ./perf script | grep -v "tos 0" | wc -l
        60
      
      Since there's no user in the kernel tree anymore of flowi6_tos, remove the
      define to avoid any future confusion on this.
      
      Fixes: b811580d ("net: IPv6 fib lookup tracepoint")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69716a2b
    • D
      vxlan: fix populating tclass in vxlan6_get_route · eaa93bf4
      Daniel Borkmann 提交于
      Jiri mentioned that flowi6_tos of struct flowi6 is never used/read
      anywhere. In fact, rest of the kernel uses the flowi6's flowlabel,
      where the traffic class _and_ the flowlabel (aka flowinfo) is encoded.
      
      For example, for policy routing, fib6_rule_match() uses ip6_tclass()
      that is applied on the flowlabel member for matching on tclass. Similar
      fix is needed for geneve, where flowi6_tos is set as well. Installing
      a v6 blackhole rule that f.e. matches on tos is now working with vxlan.
      
      Fixes: 1400615d ("vxlan: allow setting ipv6 traffic class")
      Reported-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaa93bf4
  2. 19 3月, 2016 3 次提交
  3. 15 3月, 2016 4 次提交
  4. 14 3月, 2016 3 次提交
    • A
      ipv6: Pass proto to csum_ipv6_magic as __u8 instead of unsigned short · 1e940829
      Alexander Duyck 提交于
      This patch updates csum_ipv6_magic so that it correctly recognizes that
      protocol is a unsigned 8 bit value.
      
      This will allow us to better understand what limitations may or may not be
      present in how we handle the data.  For example there are a number of
      places that call htonl on the protocol value.  This is likely not necessary
      and can be replaced with a multiplication by ntohl(1) which will be
      converted to a shift by the compiler.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e940829
    • M
      sctp: allow sctp_transmit_packet and others to use gfp · cea8768f
      Marcelo Ricardo Leitner 提交于
      Currently sctp_sendmsg() triggers some calls that will allocate memory
      with GFP_ATOMIC even when not necessary. In the case of
      sctp_packet_transmit it will allocate a linear skb that will be used to
      construct the packet and this may cause sends to fail due to ENOMEM more
      often than anticipated specially with big MTUs.
      
      This patch thus allows it to inherit gfp flags from upper calls so that
      it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
      similar. All others, like retransmits or flushes started from BH, are
      still allocated using GFP_ATOMIC.
      
      In netperf tests this didn't result in any performance drawbacks when
      memory is not too fragmented and made it trigger ENOMEM way less often.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cea8768f
    • A
      csum: Update csum_block_add to use rotate instead of byteswap · 33803963
      Alexander Duyck 提交于
      The code for csum_block_add was doing a funky byteswap to swap the even and
      odd bytes of the checksum if the offset was odd.  Instead of doing this we
      can save ourselves some trouble and just shift by 8 as this should have the
      same effect in terms of the final checksum value and only requires one
      instruction.
      
      In addition we can update csum_block_sub to just use csum_block_add with a
      inverse value for csum2.  This way we follow the same code path as
      csum_block_add without having to duplicate it.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33803963
  5. 12 3月, 2016 3 次提交
  6. 11 3月, 2016 7 次提交
  7. 10 3月, 2016 5 次提交
    • T
      kcm: Add receive message timeout · 29152a34
      Tom Herbert 提交于
      This patch adds receive timeout for message assembly on the attached TCP
      sockets. The timeout is set when a new messages is started and the whole
      message has not been received by TCP (not in the receive queue). If the
      completely message is subsequently received the timer is cancelled, if the
      timer expires the RX side is aborted.
      
      The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
      set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
      to KCM.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29152a34
    • T
      kcm: Add memory limit for receive message construction · 7ced95ef
      Tom Herbert 提交于
      Message assembly is performed on the TCP socket. This is logically
      equivalent of an application that performs a peek on the socket to find
      out how much memory is needed for a receive buffer. The receive socket
      buffer also provides the maximum message size which is checked.
      
      The receive algorithm is something like:
      
         1) Receive the first skbuf for a message (or skbufs if multiple are
            needed to determine message length).
         2) Check the message length against the number of bytes in the TCP
            receive queue (tcp_inq()).
      	- If all the bytes of the message are in the queue (incluing the
      	  skbuf received), then proceed with message assembly (it should
      	  complete with the tcp_read_sock)
              - Else, mark the psock with the number of bytes needed to
      	  complete the message.
         3) In TCP data ready function, if the psock indicates that we are
            waiting for the rest of the bytes of a messages, check the number
            of queued bytes against that.
              - If there are still not enough bytes for the message, just
      	  return
              - Else, clear the waiting bytes and proceed to receive the
      	  skbufs.  The message should now be received in one
      	  tcp_read_sock
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ced95ef
    • T
      kcm: Add statistics and proc interfaces · cd6e111b
      Tom Herbert 提交于
      This patch adds various counters for KCM. These include counters for
      messages and bytes received or sent, as well as counters for number of
      attached/unattached TCP sockets and other error or edge events.
      
      The statistics are exposed via a proc interface. /proc/net/kcm provides
      statistics per KCM socket and per psock (attached TCP sockets).
      /proc/net/kcm_stats provides aggregate statistics.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd6e111b
    • T
      kcm: Kernel Connection Multiplexor module · ab7ac4eb
      Tom Herbert 提交于
      This module implements the Kernel Connection Multiplexor.
      
      Kernel Connection Multiplexor (KCM) is a facility that provides a
      message based interface over TCP for generic application protocols.
      With KCM an application can efficiently send and receive application
      protocol messages over TCP using datagram sockets.
      
      For more information see the included Documentation/networking/kcm.txt
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab7ac4eb
    • T
      tcp: Add tcp_inq to get available receive bytes on socket · 473bd239
      Tom Herbert 提交于
      Create a common kernel function to get the number of bytes available
      on a TCP socket. This is based on code in INQ getsockopt and we now call
      the function for that getsockopt.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      473bd239
  8. 09 3月, 2016 6 次提交
  9. 07 3月, 2016 1 次提交
    • J
      ipvs: drop first packet to redirect conntrack · f719e375
      Julian Anastasov 提交于
      Jiri Bohac is reporting for a problem where the attempt
      to reschedule existing connection to another real server
      needs proper redirect for the conntrack used by the IPVS
      connection. For example, when IPVS connection is created
      to NAT-ed real server we alter the reply direction of
      conntrack. If we later decide to select different real
      server we can not alter again the conntrack. And if we
      expire the old connection, the new connection is left
      without conntrack.
      
      So, the only way to redirect both the IPVS connection and
      the Netfilter's conntrack is to drop the SYN packet that
      hits existing connection, to wait for the next jiffie
      to expire the old connection and its conntrack and to rely
      on client's retransmission to create new connection as
      usually.
      
      Jiri Bohac provided a fix that drops all SYNs on rescheduling,
      I extended his patch to do such drops only for connections
      that use conntrack. Here is the original report from Jiri Bohac:
      
      Since commit dc7b3eb9 ("ipvs: Fix reuse connection if real server
      is dead"), new connections to dead servers are redistributed
      immediately to new servers.  The old connection is expired using
      ip_vs_conn_expire_now() which sets the connection timer to expire
      immediately.
      
      However, before the timer callback, ip_vs_conn_expire(), is run
      to clean the connection's conntrack entry, the new redistributed
      connection may already be established and its conntrack removed
      instead.
      
      Fix this by dropping the first packet of the new connection
      instead, like we do when the destination server is not available.
      The timer will have deleted the old conntrack entry long before
      the first packet of the new connection is retransmitted.
      
      Fixes: dc7b3eb9 ("ipvs: Fix reuse connection if real server is dead")
      Signed-off-by: NJiri Bohac <jbohac@suse.cz>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      f719e375
  10. 04 3月, 2016 1 次提交
  11. 03 3月, 2016 1 次提交
  12. 02 3月, 2016 4 次提交
    • D
      net: ipv4: Convert IP network timestamps to be y2038 safe · 822c8685
      Deepa Dinamani 提交于
      ICMP timestamp messages and IP source route options require
      timestamps to be in milliseconds modulo 24 hours from
      midnight UT format.
      
      Add inet_current_timestamp() function to support this. The function
      returns the required timestamp in network byte order.
      
      Timestamp calculation is also changed to call ktime_get_real_ts64()
      which uses struct timespec64. struct timespec64 is y2038 safe.
      Previously it called getnstimeofday() which uses struct timespec.
      struct timespec is not y2038 safe.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      822c8685
    • J
      introduce IFE action · ef6980b6
      Jamal Hadi Salim 提交于
      This action allows for a sending side to encapsulate arbitrary metadata
      which is decapsulated by the receiving end.
      The sender runs in encoding mode and the receiver in decode mode.
      Both sender and receiver must specify the same ethertype.
      At some point we hope to have a registered ethertype and we'll
      then provide a default so the user doesnt have to specify it.
      For now we enforce the user specify it.
      
      Lets show example usage where we encode icmp from a sender towards
      a receiver with an skbmark of 17; both sender and receiver use
      ethertype of 0xdead to interop.
      
      YYYY: Lets start with Receiver-side policy config:
      xxx: add an ingress qdisc
      sudo tc qdisc add dev $ETH ingress
      
      xxx: any packets with ethertype 0xdead will be subjected to ife decoding
      xxx: we then restart the classification so we can match on icmp at prio 3
      sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
      u32 match u32 0 0 flowid 1:1 \
      action ife decode reclassify
      
      xxx: on restarting the classification from above if it was an icmp
      xxx: packet, then match it here and continue to the next rule at prio 4
      xxx: which will match based on skb mark of 17
      sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
      u32 match ip protocol 1 0xff flowid 1:1 \
      action continue
      
      xxx: match on skbmark of 0x11 (decimal 17) and accept
      sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
      handle 0x11 fw flowid 1:1 \
      action ok
      
      xxx: Lets show the decoding policy
      sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
      xxx:
      filter pref 2 u32
      filter pref 2 u32 fh 800: ht divisor 1
      filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 0 success 0)
        match 00000000/00000000 at 0 (success 0 )
              action order 1: ife decode action reclassify
               index 1 ref 1 bind 1 installed 14 sec used 14 sec
               type: 0x0
               Metadata: allow mark allow hash allow prio allow qmap
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
      xxx:
      Observe that above lists all metadatum it can decode. Typically these
      submodules will already be compiled into a monolithic kernel or
      loaded as modules
      
      YYYY: Lets show the sender side now ..
      
      xxx: Add an egress qdisc on the sender netdev
      sudo tc qdisc add dev $ETH root handle 1: prio
      xxx:
      xxx: Match all icmp packets to 192.168.122.237/24, then
      xxx: tag the packet with skb mark of decimal 17, then
      xxx: Encode it with:
      xxx:	ethertype 0xdead
      xxx:	add skb->mark to whitelist of metadatum to send
      xxx:	rewrite target dst MAC address to 02:15:15:15:15:15
      xxx:
      sudo $TC filter add dev $ETH parent 1: protocol ip prio 10  u32 \
      match ip dst 192.168.122.237/24 \
      match ip protocol 1 0xff \
      flowid 1:2 \
      action skbedit mark 17 \
      action ife encode \
      type 0xDEAD \
      allow mark \
      dst 02:15:15:15:15:15
      
      xxx: Lets show the encoding policy
      sudo tc -s filter ls dev $ETH parent 1: protocol ip
      xxx:
      filter pref 10 u32
      filter pref 10 u32 fh 800: ht divisor 1
      filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2  (rule hit 0 success 0)
        match c0a87aed/ffffffff at 16 (success 0 )
        match 00010000/00ff0000 at 8 (success 0 )
      
      	action order 1:  skbedit mark 17
      	 index 6 ref 1 bind 1
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      	action order 2: ife encode action pipe
      	 index 3 ref 1 bind 1
      	 dst MAC: 02:15:15:15:15:15 type: 0xDEAD
       	 Metadata: allow mark
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      xxx:
      
      test by sending ping from sender to destination
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef6980b6
    • V
      net: dsa: support VLAN filtering switchdev attr · fb2dabad
      Vivien Didelot 提交于
      When a user explicitly requests VLAN filtering with something like:
      
          # echo 1 > /sys/class/net/<bridge>/bridge/vlan_filtering
      
      Switchdev propagates a SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING port
      attribute.
      
      Add support for it in the DSA layer with a new port_vlan_filtering
      function to let drivers toggle 802.1Q filtering on user demand.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb2dabad
    • J
      Introduce devlink infrastructure · bfcd3a46
      Jiri Pirko 提交于
      Introduce devlink infrastructure for drivers to register and expose to
      userspace via generic Netlink interface.
      
      There are two basic objects defined:
      devlink - one instance for every "parent device", for example switch ASIC
      devlink port - one instance for every physical port of the device.
      
      This initial portion implements basic get/dump of objects to userspace.
      Also, port splitter and port type setting is implemented.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfcd3a46