1. 15 4月, 2016 3 次提交
    • C
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek 提交于
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d894ba18
    • M
      ipv6: udp: Do a route lookup and update during release_cb · e646b657
      Martin KaFai Lau 提交于
      This patch adds a release_cb for UDPv6.  It does a route lookup
      and updates sk->sk_dst_cache if it is needed.  It picks up the
      left-over job from ip6_sk_update_pmtu() if the sk was owned
      by user during the pmtu update.
      
      It takes a rcu_read_lock to protect the __sk_dst_get() operations
      because another thread may do ip6_dst_store() without taking the
      sk lock (e.g. sendmsg).
      
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reported-by: NWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e646b657
    • M
      ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update · 33c162a9
      Martin KaFai Lau 提交于
      There is a case in connected UDP socket such that
      getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
      sequence could be the following:
      1. Create a connected UDP socket
      2. Send some datagrams out
      3. Receive a ICMPV6_PKT_TOOBIG
      4. No new outgoing datagrams to trigger the sk_dst_check()
         logic to update the sk->sk_dst_cache.
      5. getsockopt(IPV6_MTU) returns the mtu from the invalid
         sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
      
      This patch updates the sk->sk_dst_cache for a connected datagram sk
      during pmtu-update code path.
      
      Note that the sk->sk_v6_daddr is used to do the route lookup
      instead of skb->data (i.e. iph).  It is because a UDP socket can become
      connected after sending out some datagrams in un-connected state.  or
      It can be connected multiple times to different destinations.  Hence,
      iph may not be related to where sk is currently connected to.
      
      It is done under '!sock_owned_by_user(sk)' condition because
      the user may make another ip6_datagram_connect()  (i.e changing
      the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
      code path.
      
      For the sock_owned_by_user(sk) == true case, the next patch will
      introduce a release_cb() which will update the sk->sk_dst_cache.
      
      Test:
      
      Server (Connected UDP Socket):
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Route Details:
      [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
      2fac::/64 dev eth0  proto kernel  metric 256  pref medium
      2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium
      
      A simple python code to create a connected UDP socket:
      
      import socket
      import errno
      
      HOST = '2fac::1'
      PORT = 8080
      
      s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
      s.bind((HOST, PORT))
      s.connect(('2fac:face::face', 53))
      print("connected")
      while True:
          try:
      	data = s.recv(1024)
          except socket.error as se:
      	if se.errno == errno.EMSGSIZE:
      		pmtu = s.getsockopt(41, 24)
      		print("PMTU:%d" % pmtu)
      		break
      s.close()
      
      Python program output after getting a ICMPV6_PKT_TOOBIG:
      [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
      connected
      PMTU:1300
      
      Cache routes after recieving TOOBIG:
      [root@arch-fb-vm1 ~]# ip -6 r show table cache
      2fac:face::face via 2fac::face dev eth0  metric 0
          cache  expires 463sec mtu 1300 pref medium
      
      Client (Send the ICMPV6_PKT_TOOBIG):
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      scapy is used to generate the TOOBIG message.  Here is the scapy script I have
      used:
      
      >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
      1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
      >>> sendp(p, iface='qemubr0')
      
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reported-by: NWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33c162a9
  2. 12 4月, 2016 1 次提交
    • D
      net: vrf: Fix dst reference counting · 9ab179d8
      David Ahern 提交于
      Vivek reported a kernel exception deleting a VRF with an active
      connection through it. The root cause is that the socket has a cached
      reference to a dst that is destroyed. Converting the dst_destroy to
      dst_release and letting proper reference counting kick in does not
      work as the dst has a reference to the device which needs to be released
      as well.
      
      I talked to Hannes about this at netdev and he pointed out the ipv4 and
      ipv6 dst handling has dst_ifdown for just this scenario. Rather than
      continuing with the reinvented dst wheel in VRF just remove it and
      leverage the ipv4 and ipv6 versions.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ab179d8
  3. 11 4月, 2016 1 次提交
    • M
      sctp: avoid refreshing heartbeat timer too often · ba6f5e33
      Marcelo Ricardo Leitner 提交于
      Currently on high rate SCTP streams the heartbeat timer refresh can
      consume quite a lot of resources as timer updates are costly and it
      contains a random factor, which a) is also costly and b) invalidates
      mod_timer() optimization for not editing a timer to the same value.
      It may even cause the timer to be slightly advanced, for no good reason.
      
      As suggested by David Laight this patch now removes this timer update
      from hot path by leaving the timer on and re-evaluating upon its
      expiration if the heartbeat is still needed or not, similarly to what is
      done for TCP. If it's not needed anymore the timer is re-scheduled to
      the new timeout, considering the time already elapsed.
      
      For this, we now record the last tx timestamp per transport, updated in
      the same spots as hb timer was restarted on tx. Also split up
      sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
      sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
      heartbeat one.
      
      On loopback with MTU of 65535 and data chunks with 1636, so that we
      have a considerable amount of chunks without stressing system calls,
      netperf -t SCTP_STREAM -l 30, perf looked like this before:
      
      Samples: 103K of event 'cpu-clock', Event count (approx.): 25833000000
        Overhead  Command  Shared Object      Symbol
      +    6,15%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
      -    5,43%  netperf  [kernel.vmlinux]   [k] _raw_write_unlock_irqrestore
         - _raw_write_unlock_irqrestore
            - 96,54% _raw_spin_unlock_irqrestore
               - 36,14% mod_timer
                  + 97,24% sctp_transport_reset_timers
                  + 2,76% sctp_do_sm
               + 33,65% __wake_up_sync_key
               + 28,77% sctp_ulpq_tail_event
               + 1,40% del_timer
            - 1,84% mod_timer
               + 99,03% sctp_transport_reset_timers
               + 0,97% sctp_do_sm
            + 1,50% sctp_ulpq_tail_event
      
      And after this patch, now with netperf -l 60:
      
      Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
        Overhead  Command  Shared Object      Symbol
      +    5,65%  netperf  [kernel.vmlinux]   [k] memcpy_erms
      +    5,59%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
      -    5,05%  netperf  [kernel.vmlinux]   [k] _raw_spin_unlock_irqrestore
         - _raw_spin_unlock_irqrestore
            + 49,89% __wake_up_sync_key
            + 45,68% sctp_ulpq_tail_event
            - 2,85% mod_timer
               + 76,51% sctp_transport_reset_t3_rtx
               + 23,49% sctp_do_sm
            + 1,55% del_timer
      +    2,50%  netperf  [sctp]             [k] sctp_datamsg_from_user
      +    2,26%  netperf  [sctp]             [k] sctp_sendmsg
      
      Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
      ~3.7%.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba6f5e33
  4. 07 4月, 2016 1 次提交
  5. 06 4月, 2016 1 次提交
  6. 05 4月, 2016 1 次提交
  7. 24 3月, 2016 1 次提交
  8. 23 3月, 2016 1 次提交
  9. 22 3月, 2016 1 次提交
  10. 21 3月, 2016 5 次提交
    • J
      tunnels: Remove encapsulation offloads on decap. · a09a4c8d
      Jesse Gross 提交于
      If a packet is either locally encapsulated or processed through GRO
      it is marked with the offloads that it requires. However, when it is
      decapsulated these tunnel offload indications are not removed. This
      means that if we receive an encapsulated TCP packet, aggregate it with
      GRO, decapsulate, and retransmit the resulting frame on a NIC that does
      not support encapsulation, we won't be able to take advantage of hardware
      offloads even though it is just a simple TCP packet at this point.
      
      This fixes the problem by stripping off encapsulation offload indications
      when packets are decapsulated.
      
      The performance impacts of this bug are significant. In a test where a
      Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated,
      and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a
      result of avoiding unnecessary segmentation at the VM tap interface.
      Reported-by: NRamu Ramamurthy <sramamur@linux.vnet.ibm.com>
      Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE")
      Signed-off-by: NJesse Gross <jesse@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a09a4c8d
    • M
      sctp: keep fragmentation point aligned to word size · 659e0bca
      Marcelo Ricardo Leitner 提交于
      If the user supply a different fragmentation point or if there is a
      network header that cause it to not be aligned, force it to be aligned.
      
      Fragmentation point at a value that is not aligned is not optimal.  It
      causes extra padding to be used and has just no pros.
      
      v2:
       - Make use of the new WORD_TRUNC macro
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      659e0bca
    • M
      sctp: align MTU to a word · 3822a5ff
      Marcelo Ricardo Leitner 提交于
      SCTP is a protocol that is aligned to a word (4 bytes). Thus using bare
      MTU can sometimes return values that are not aligned, like for loopback,
      which is 65536 but ipv4_mtu() limits that to 65535. This mis-alignment
      will cause the last non-aligned bytes to never be used and can cause
      issues with congestion control.
      
      So it's better to just consider a lower MTU and keep congestion control
      calcs saner as they are based on PMTU.
      
      Same applies to icmp frag needed messages, which is also fixed by this
      patch.
      
      One other effect of this is the inability to send MTU-sized packet
      without queueing or fragmentation and without hitting Nagle. As the
      check performed at sctp_packet_can_append_data():
      
      if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
      	/* Enough data queued to fill a packet */
      	return SCTP_XMIT_OK;
      
      with the above example of MTU, if there are no other messages queued,
      one cannot send a packet that just fits one packet (65532 bytes) and
      without causing DATA chunk fragmentation or a delay.
      
      v2:
       - Added WORD_TRUNC macro
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3822a5ff
    • D
      ipv6, trace: fix tos reporting on fib6_table_lookup · 69716a2b
      Daniel Borkmann 提交于
      flowi6_tos of struct flowi6 is unused in IPv6, therefore dumping tos on
      that tracepoint will also give incorrect information wrt traffic class.
      
      If we want to fix it, we need to extract it via ip6_tclass(flp->flowlabel).
      While for the same test case I get a count of 0 non-zero tos values before
      the change, they now start to show up after the change:
      
        # ./perf record -e fib6:fib6_table_lookup -a sleep 10
        # ./perf script | grep -v "tos 0" | wc -l
        60
      
      Since there's no user in the kernel tree anymore of flowi6_tos, remove the
      define to avoid any future confusion on this.
      
      Fixes: b811580d ("net: IPv6 fib lookup tracepoint")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69716a2b
    • D
      vxlan: fix populating tclass in vxlan6_get_route · eaa93bf4
      Daniel Borkmann 提交于
      Jiri mentioned that flowi6_tos of struct flowi6 is never used/read
      anywhere. In fact, rest of the kernel uses the flowi6's flowlabel,
      where the traffic class _and_ the flowlabel (aka flowinfo) is encoded.
      
      For example, for policy routing, fib6_rule_match() uses ip6_tclass()
      that is applied on the flowlabel member for matching on tclass. Similar
      fix is needed for geneve, where flowi6_tos is set as well. Installing
      a v6 blackhole rule that f.e. matches on tos is now working with vxlan.
      
      Fixes: 1400615d ("vxlan: allow setting ipv6 traffic class")
      Reported-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaa93bf4
  11. 19 3月, 2016 3 次提交
  12. 15 3月, 2016 4 次提交
  13. 14 3月, 2016 3 次提交
    • A
      ipv6: Pass proto to csum_ipv6_magic as __u8 instead of unsigned short · 1e940829
      Alexander Duyck 提交于
      This patch updates csum_ipv6_magic so that it correctly recognizes that
      protocol is a unsigned 8 bit value.
      
      This will allow us to better understand what limitations may or may not be
      present in how we handle the data.  For example there are a number of
      places that call htonl on the protocol value.  This is likely not necessary
      and can be replaced with a multiplication by ntohl(1) which will be
      converted to a shift by the compiler.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e940829
    • M
      sctp: allow sctp_transmit_packet and others to use gfp · cea8768f
      Marcelo Ricardo Leitner 提交于
      Currently sctp_sendmsg() triggers some calls that will allocate memory
      with GFP_ATOMIC even when not necessary. In the case of
      sctp_packet_transmit it will allocate a linear skb that will be used to
      construct the packet and this may cause sends to fail due to ENOMEM more
      often than anticipated specially with big MTUs.
      
      This patch thus allows it to inherit gfp flags from upper calls so that
      it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
      similar. All others, like retransmits or flushes started from BH, are
      still allocated using GFP_ATOMIC.
      
      In netperf tests this didn't result in any performance drawbacks when
      memory is not too fragmented and made it trigger ENOMEM way less often.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cea8768f
    • A
      csum: Update csum_block_add to use rotate instead of byteswap · 33803963
      Alexander Duyck 提交于
      The code for csum_block_add was doing a funky byteswap to swap the even and
      odd bytes of the checksum if the offset was odd.  Instead of doing this we
      can save ourselves some trouble and just shift by 8 as this should have the
      same effect in terms of the final checksum value and only requires one
      instruction.
      
      In addition we can update csum_block_sub to just use csum_block_add with a
      inverse value for csum2.  This way we follow the same code path as
      csum_block_add without having to duplicate it.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33803963
  14. 12 3月, 2016 3 次提交
  15. 11 3月, 2016 7 次提交
  16. 10 3月, 2016 4 次提交
    • T
      kcm: Add receive message timeout · 29152a34
      Tom Herbert 提交于
      This patch adds receive timeout for message assembly on the attached TCP
      sockets. The timeout is set when a new messages is started and the whole
      message has not been received by TCP (not in the receive queue). If the
      completely message is subsequently received the timer is cancelled, if the
      timer expires the RX side is aborted.
      
      The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
      set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
      to KCM.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29152a34
    • T
      kcm: Add memory limit for receive message construction · 7ced95ef
      Tom Herbert 提交于
      Message assembly is performed on the TCP socket. This is logically
      equivalent of an application that performs a peek on the socket to find
      out how much memory is needed for a receive buffer. The receive socket
      buffer also provides the maximum message size which is checked.
      
      The receive algorithm is something like:
      
         1) Receive the first skbuf for a message (or skbufs if multiple are
            needed to determine message length).
         2) Check the message length against the number of bytes in the TCP
            receive queue (tcp_inq()).
      	- If all the bytes of the message are in the queue (incluing the
      	  skbuf received), then proceed with message assembly (it should
      	  complete with the tcp_read_sock)
              - Else, mark the psock with the number of bytes needed to
      	  complete the message.
         3) In TCP data ready function, if the psock indicates that we are
            waiting for the rest of the bytes of a messages, check the number
            of queued bytes against that.
              - If there are still not enough bytes for the message, just
      	  return
              - Else, clear the waiting bytes and proceed to receive the
      	  skbufs.  The message should now be received in one
      	  tcp_read_sock
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ced95ef
    • T
      kcm: Add statistics and proc interfaces · cd6e111b
      Tom Herbert 提交于
      This patch adds various counters for KCM. These include counters for
      messages and bytes received or sent, as well as counters for number of
      attached/unattached TCP sockets and other error or edge events.
      
      The statistics are exposed via a proc interface. /proc/net/kcm provides
      statistics per KCM socket and per psock (attached TCP sockets).
      /proc/net/kcm_stats provides aggregate statistics.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd6e111b
    • T
      kcm: Kernel Connection Multiplexor module · ab7ac4eb
      Tom Herbert 提交于
      This module implements the Kernel Connection Multiplexor.
      
      Kernel Connection Multiplexor (KCM) is a facility that provides a
      message based interface over TCP for generic application protocols.
      With KCM an application can efficiently send and receive application
      protocol messages over TCP using datagram sockets.
      
      For more information see the included Documentation/networking/kcm.txt
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab7ac4eb