1. 05 4月, 2016 5 次提交
    • E
      tcp/dccp: use rcu locking in inet_diag_find_one_icsk() · 2d331915
      Eric Dumazet 提交于
      RX packet processing holds rcu_read_lock(), so we can remove
      pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
      if inet_diag also holds rcu before calling them.
      
      This is needed anyway as __inet_lookup_listener() and
      inet6_lookup_listener() will soon no longer increment
      refcount on the found listener.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d331915
    • E
      tcp/dccp: remove BH disable/enable in lookup · ee3cf32a
      Eric Dumazet 提交于
      Since linux 2.6.29, lookups only use rcu locking.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee3cf32a
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
    • S
      ipv6: process socket-level control messages in IPv6 · ad1e46a8
      Soheil Hassas Yeganeh 提交于
      Process socket-level control messages by invoking
      __sock_cmsg_send in ip6_datagram_send_ctl for control messages on
      the SOL_SOCKET layer.
      
      This makes sure whenever ip6_datagram_send_ctl is called for
      udp and raw, we also process socket-level control messages.
      
      This is a bit uglier than IPv4, since IPv6 does not have
      something like ipcm_cookie. Perhaps we can later create
      a control message cookie for IPv6?
      
      Note that this commit interprets new control messages that
      were ignored before. As such, this commit does not change
      the behavior of IPv6 control messages.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad1e46a8
  2. 31 3月, 2016 1 次提交
  3. 28 3月, 2016 4 次提交
  4. 24 3月, 2016 1 次提交
  5. 21 3月, 2016 2 次提交
    • J
      tunnels: Remove encapsulation offloads on decap. · a09a4c8d
      Jesse Gross 提交于
      If a packet is either locally encapsulated or processed through GRO
      it is marked with the offloads that it requires. However, when it is
      decapsulated these tunnel offload indications are not removed. This
      means that if we receive an encapsulated TCP packet, aggregate it with
      GRO, decapsulate, and retransmit the resulting frame on a NIC that does
      not support encapsulation, we won't be able to take advantage of hardware
      offloads even though it is just a simple TCP packet at this point.
      
      This fixes the problem by stripping off encapsulation offload indications
      when packets are decapsulated.
      
      The performance impacts of this bug are significant. In a test where a
      Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated,
      and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a
      result of avoiding unnecessary segmentation at the VM tap interface.
      Reported-by: NRamu Ramamurthy <sramamur@linux.vnet.ibm.com>
      Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE")
      Signed-off-by: NJesse Gross <jesse@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a09a4c8d
    • J
      tunnels: Don't apply GRO to multiple layers of encapsulation. · fac8e0f5
      Jesse Gross 提交于
      When drivers express support for TSO of encapsulated packets, they
      only mean that they can do it for one layer of encapsulation.
      Supporting additional levels would mean updating, at a minimum,
      more IP length fields and they are unaware of this.
      
      No encapsulation device expresses support for handling offloaded
      encapsulated packets, so we won't generate these types of frames
      in the transmit path. However, GRO doesn't have a check for
      multiple levels of encapsulation and will attempt to build them.
      
      UDP tunnel GRO actually does prevent this situation but it only
      handles multiple UDP tunnels stacked on top of each other. This
      generalizes that solution to prevent any kind of tunnel stacking
      that would cause problems.
      
      Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack")
      Signed-off-by: NJesse Gross <jesse@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fac8e0f5
  6. 16 3月, 2016 1 次提交
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  7. 15 3月, 2016 2 次提交
    • J
      netfilter: Allow calling into nat helper without skb_dst. · 26461905
      Jarno Rajahalme 提交于
      NAT checksum recalculation code assumes existence of skb_dst, which
      becomes a problem for a later patch in the series ("openvswitch:
      Interface with NAT.").  Simplify this by removing the check on
      skb_dst, as the checksum will be dealt with later in the stack.
      Suggested-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      26461905
    • M
      tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In · a44d6eac
      Martin KaFai Lau 提交于
      Per RFC4898, they count segments sent/received
      containing a positive length data segment (that includes
      retransmission segments carrying data).  Unlike
      tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
      carrying no data (e.g. pure ack).
      
      The patch also updates the segs_in in tcp_fastopen_add_skb()
      so that segs_in >= data_segs_in property is kept.
      
      Together with retransmission data, tcpi_data_segs_out
      gives a better signal on the rxmit rate.
      
      v6: Rebase on the latest net-next
      
      v5: Eric pointed out that checking skb->len is still needed in
      tcp_fastopen_add_skb() because skb can carry a FIN without data.
      Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
      helper is used.  Comment is added to the fastopen case to explain why
      segs_in has to be reset and tcp_segs_in() has to be called before
      __skb_pull().
      
      v4: Add comment to the changes in tcp_fastopen_add_skb()
      and also add remark on this case in the commit message.
      
      v3: Add const modifier to the skb parameter in tcp_segs_in()
      
      v2: Rework based on recent fix by Eric:
      commit a9d99ce2 ("tcp: fix tcpi_segs_in after connection establishment")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a44d6eac
  8. 14 3月, 2016 2 次提交
  9. 12 3月, 2016 1 次提交
  10. 09 3月, 2016 3 次提交
  11. 08 3月, 2016 1 次提交
    • B
      udp6: fix UDP/IPv6 encap resubmit path · 59dca1d8
      Bill Sommerfeld 提交于
      IPv4 interprets a negative return value from a protocol handler as a
      request to redispatch to a new protocol.  In contrast, IPv6 interprets a
      negative value as an error, and interprets a positive value as a request
      for redispatch.
      
      UDP for IPv6 was unaware of this difference.  Change __udp6_lib_rcv() to
      return a positive value for redispatch.  Note that the socket's
      encap_rcv hook still needs to return a negative value to request
      dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to
      identify the byte containing the next protocol.
      Signed-off-by: NBill Sommerfeld <wsommerfeld@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59dca1d8
  12. 04 3月, 2016 3 次提交
    • D
      net: ipv6: Fix refcnt on host routes · 799977d9
      David Ahern 提交于
      Andrew and Ying Huang's test robot both reported usage count problems that
      trace back to the 'keep address on ifdown' patch.
      
      >From Andrew:
      We execute CRIU test on linux-next. On the current linux-next kernel
      they hangs on creating a network namespace.
      
      The kernel log contains many massages like this:
      [ 1036.122108] unregister_netdevice: waiting for lo to become free.
      Usage count = 2
      [ 1046.165156] unregister_netdevice: waiting for lo to become free.
      Usage count = 2
      [ 1056.210287] unregister_netdevice: waiting for lo to become free.
      Usage count = 2
      
      I tried to revert this patch and the bug disappeared.
      
      Here is a set of commands to reproduce this bug:
      
      [root@linux-next-test linux-next]# uname -a
      Linux linux-next-test 4.5.0-rc6-next-20160301+ #3 SMP Wed Mar 2
      17:32:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
      
      [root@linux-next-test ~]# unshare -n
      [root@linux-next-test ~]# ip link set up dev lo
      [root@linux-next-test ~]# ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
      group default qlen 1
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host
             valid_lft forever preferred_lft forever
      [root@linux-next-test ~]# logout
      [root@linux-next-test ~]# unshare -n
      
       -----
      
      The problem is a change made to RTM_DELADDR case in __ipv6_ifa_notify that
      was added in an early version of the offending patch and is no longer
      needed.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Cc: Andrey Wagin <avagin@gmail.com>
      Cc: Ying Huang <ying.huang@linux.intel.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NJeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      799977d9
    • F
      ipv6: re-enable fragment header matching in ipv6_find_hdr · 5d150a98
      Florian Westphal 提交于
      When ipv6_find_hdr is used to find a fragment header
      (caller specifies target NEXTHDR_FRAGMENT) we erronously return
      -ENOENT for all fragments with nonzero offset.
      
      Before commit 9195bb8e, when target was specified, we did not
      enter the exthdr walk loop as nexthdr == target so this used to work.
      
      Now we do (so we can skip empty route headers). When we then stumble upon
      a frag with nonzero frag_off we must return -ENOENT ("header not found")
      only if the caller did not specifically request NEXTHDR_FRAGMENT.
      
      This allows nfables exthdr expression to match ipv6 fragments, e.g. via
      
      nft add rule ip6 filter input frag frag-off gt 0
      
      Fixes: 9195bb8e ("ipv6: improve ipv6_find_hdr() to skip empty routing headers")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d150a98
    • B
      mld, igmp: Fix reserved tailroom calculation · 1837b2e2
      Benjamin Poirier 提交于
      The current reserved_tailroom calculation fails to take hlen and tlen into
      account.
      
      skb:
      [__hlen__|__data____________|__tlen___|__extra__]
      ^                                               ^
      head                                            skb_end_offset
      
      In this representation, hlen + data + tlen is the size passed to alloc_skb.
      "extra" is the extra space made available in __alloc_skb because of
      rounding up by kmalloc. We can reorder the representation like so:
      
      [__hlen__|__data____________|__extra__|__tlen___]
      ^                                               ^
      head                                            skb_end_offset
      
      The maximum space available for ip headers and payload without
      fragmentation is min(mtu, data + extra). Therefore,
      reserved_tailroom
      = data + extra + tlen - min(mtu, data + extra)
      = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
      = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
      
      Compare the second line to the current expression:
      reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
      and we can see that hlen and tlen are not taken into account.
      
      The min() in the third line can be expanded into:
      if mtu < skb_tailroom - tlen:
      	reserved_tailroom = skb_tailroom - mtu
      else:
      	reserved_tailroom = tlen
      
      Depending on hlen, tlen, mtu and the number of multicast address records,
      the current code may output skbs that have less tailroom than
      dev->needed_tailroom or it may output more skbs than needed because not all
      space available is used.
      
      Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
      Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1837b2e2
  13. 03 3月, 2016 3 次提交
    • P
      netfilter: nft_masq: support port range · 8a6bf5da
      Pablo Neira Ayuso 提交于
      Complete masquerading support by allowing port range selection.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8a6bf5da
    • F
      netfilter: xtables: don't hook tables by default · b9e69e12
      Florian Westphal 提交于
      delay hook registration until the table is being requested inside a
      namespace.
      
      Historically, a particular table (iptables mangle, ip6tables filter, etc)
      was registered on module load.
      
      When netns support was added to iptables only the ip/ip6tables ruleset was
      made namespace aware, not the actual hook points.
      
      This means f.e. that when ipt_filter table/module is loaded on a system,
      then each namespace on that system has an (empty) iptables filter ruleset.
      
      In other words, if a namespace sends a packet, such skb is 'caught' by
      netfilter machinery and fed to hooking points for that table (i.e. INPUT,
      FORWARD, etc).
      
      Thanks to Eric Biederman, hooks are no longer global, but per namespace.
      
      This means that we can avoid allocation of empty ruleset in a namespace and
      defer hook registration until we need the functionality.
      
      We register a tables hook entry points ONLY in the initial namespace.
      When an iptables get/setockopt is issued inside a given namespace, we check
      if the table is found in the per-namespace list.
      
      If not, we attempt to find it in the initial namespace, and, if found,
      create an empty default table in the requesting namespace and register the
      needed hooks.
      
      Hook points are destroyed only once namespace is deleted, there is no
      'usage count' (it makes no sense since there is no 'remove table' operation
      in xtables api).
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b9e69e12
    • F
      netfilter: xtables: prepare for on-demand hook register · a67dd266
      Florian Westphal 提交于
      This change prepares for upcoming on-demand xtables hook registration.
      
      We change the protoypes of the register/unregister functions.
      A followup patch will then add nf_hook_register/unregister calls
      to the iptables one.
      
      Once a hook is registered packets will be picked up, so all assignments
      of the form
      
      net->ipv4.iptable_$table = new_table
      
      have to be moved to ip(6)t_register_table, else we can see NULL
      net->ipv4.iptable_$table later.
      
      This patch doesn't change functionality; without this the actual change
      simply gets too big.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a67dd266
  14. 02 3月, 2016 2 次提交
  15. 27 2月, 2016 1 次提交
    • A
      GSO: Provide software checksum of tunneled UDP fragmentation offload · 22463876
      Alexander Duyck 提交于
      On reviewing the code I realized that GRE and UDP tunnels could cause a
      kernel panic if we used GSO to segment a large UDP frame that was sent
      through the tunnel with an outer checksum and hardware offloads were not
      available.
      
      In order to correct this we need to update the feature flags that are
      passed to the skb_segment function so that in the event of UDP
      fragmentation being requested for the inner header the segmentation
      function will correctly generate the checksum for the payload if we cannot
      segment the outer header.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22463876
  16. 26 2月, 2016 1 次提交
    • D
      net: ipv6: Make address flushing on ifdown optional · f1705ec1
      David Ahern 提交于
      Currently, all ipv6 addresses are flushed when the interface is configured
      down, including global, static addresses:
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          << nothing; all addresses have been flushed>>
      
      Add a new sysctl to make this behavior optional. The new setting defaults to
      flush all addresses to maintain backwards compatibility. When the set global
      addresses with no expire times are not flushed on an admin down. The sysctl
      is per-interface or system-wide for all interfaces
      
          $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
      or
          $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
      
      Will keep addresses on eth1 on an admin down.
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
              inet6 2100:1::2/120 scope global tentative
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
                 valid_lft forever preferred_lft forever
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1705ec1
  17. 24 2月, 2016 1 次提交
    • B
      tunnel: Clear IPCB(skb)->opt before dst_link_failure called · 5146d1f1
      Bernie Harris 提交于
      IPCB may contain data from previous layers (in the observed case the
      qdisc layer). In the observed scenario, the data was misinterpreted as
      ip header options, which later caused the ihl to be set to an invalid
      value (<5). This resulted in an infinite loop in the mips implementation
      of ip_fast_csum.
      
      This patch clears IPCB(skb)->opt before dst_link_failure can be called for
      various types of tunnels. This change only applies to encapsulated ipv4
      packets.
      
      The code introduced in 11c21a30 which clears all of IPCB has been removed
      to be consistent with these changes, and instead the opt field is cleared
      unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
      SIT, GRE, and IPIP tunnels.
      
      The relevant vti, l2tp, and pptp functions already contain similar code for
      clearing the IPCB.
      Signed-off-by: NBernie Harris <bernie.harris@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5146d1f1
  18. 22 2月, 2016 1 次提交
  19. 20 2月, 2016 3 次提交
  20. 19 2月, 2016 2 次提交