1. 14 2月, 2014 1 次提交
    • F
      net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f
      Florian Westphal 提交于
      Marcelo Ricardo Leitner reported problems when the forwarding link path
      has a lower mtu than the incoming one if the inbound interface supports GRO.
      
      Given:
      Host <mtu1500> R1 <mtu1200> R2
      
      Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.
      
      In this case, the kernel will fail to send ICMP fragmentation needed
      messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
      checks in forward path. Instead, Linux tries to send out packets exceeding
      the mtu.
      
      When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
      not fragment the packets when forwarding, and again tries to send out
      packets exceeding R1-R2 link mtu.
      
      This alters the forwarding dstmtu checks to take the individual gso
      segment lengths into account.
      
      For ipv6, we send out pkt too big error for gso if the individual
      segments are too big.
      
      For ipv4, we either send icmp fragmentation needed, or, if the DF bit
      is not set, perform software segmentation and let the output path
      create fragments when the packet is leaving the machine.
      It is not 100% correct as the error message will contain the headers of
      the GRO skb instead of the original/segmented one, but it seems to
      work fine in my (limited) tests.
      
      Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
      sofware segmentation.
      
      However it turns out that skb_segment() assumes skb nr_frags is related
      to mss size so we would BUG there.  I don't want to mess with it considering
      Herbert and Eric disagree on what the correct behavior should be.
      
      Hannes Frederic Sowa notes that when we would shrink gso_size
      skb_segment would then also need to deal with the case where
      SKB_MAX_FRAGS would be exceeded.
      
      This uses sofware segmentation in the forward path when we hit ipv4
      non-DF packets and the outgoing link mtu is too small.  Its not perfect,
      but given the lack of bug reports wrt. GRO fwd being broken this is a
      rare case anyway.  Also its not like this could not be improved later
      once the dust settles.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Reported-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe6cc55f
  2. 10 2月, 2014 1 次提交
  3. 06 2月, 2014 2 次提交
  4. 28 1月, 2014 1 次提交
    • H
      net: Fix memory leak if TPROXY used with TCP early demux · a452ce34
      Holger Eitzenberger 提交于
      I see a memory leak when using a transparent HTTP proxy using TPROXY
      together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
      
      unreferenced object 0xffff88008cba4a40 (size 1696):
        comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
        hex dump (first 32 bytes):
          0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
          02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
          [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
          [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
          [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
          [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
          [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
          [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
          [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
          [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
          [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
          [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
          [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
          [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
          [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
          [<ffffffff8127c949>] net_rx_action+0xa7/0x200
          [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
      
      But there are many more, resulting in the machine going OOM after some
      days.
      
      From looking at the TPROXY code, and with help from Florian, I see
      that the memory leak is introduced in tcp_v4_early_demux():
      
        void tcp_v4_early_demux(struct sk_buff *skb)
        {
          /* ... */
      
          iph = ip_hdr(skb);
          th = tcp_hdr(skb);
      
          if (th->doff < sizeof(struct tcphdr) / 4)
              return;
      
          sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                             iph->saddr, th->source,
                             iph->daddr, ntohs(th->dest),
                             skb->skb_iif);
          if (sk) {
              skb->sk = sk;
      
      where the socket is assigned unconditionally to skb->sk, also bumping
      the refcnt on it.  This is problematic, because in our case the skb
      has already a socket assigned in the TPROXY target.  This then results
      in the leak I see.
      
      The very same issue seems to be with IPv6, but haven't tested.
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NHolger Eitzenberger <holger@eitzenberger.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a452ce34
  5. 25 1月, 2014 1 次提交
  6. 23 1月, 2014 1 次提交
  7. 22 1月, 2014 3 次提交
  8. 20 1月, 2014 5 次提交
  9. 19 1月, 2014 1 次提交
  10. 18 1月, 2014 3 次提交
  11. 16 1月, 2014 3 次提交
  12. 15 1月, 2014 5 次提交
  13. 14 1月, 2014 1 次提交
    • H
      ipv6: introduce ip6_dst_mtu_forward and protect forwarding path with it · 0954cf9c
      Hannes Frederic Sowa 提交于
      In the IPv6 forwarding path we are only concerend about the outgoing
      interface MTU, but also respect locked MTUs on routes. Tunnel provider
      or IPSEC already have to recheck and if needed send PtB notifications
      to the sending host in case the data does not fit into the packet with
      added headers (we only know the final header sizes there, while also
      using path MTU information).
      
      The reason for this change is, that path MTU information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv6 forwarding path to
      wrongfully emit Packet-too-Big errors and drop IPv6 packets.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0954cf9c
  14. 10 1月, 2014 6 次提交
  15. 08 1月, 2014 6 次提交
    • P
      netfilter: nf_tables: add "inet" table for IPv4/IPv6 · 1d49144c
      Patrick McHardy 提交于
      This patch adds a new table family and a new filter chain that you can
      use to attach IPv4 and IPv6 rules. This should help to simplify
      rule-set maintainance in dual-stack setups.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1d49144c
    • P
      netfilter: nf_tables: add support for multi family tables · 115a60b1
      Patrick McHardy 提交于
      Add support to register chains to multiple hooks for different address
      families for mixed IPv4/IPv6 tables.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      115a60b1
    • P
      netfilter: nf_tables: make chain types override the default AF functions · 3b088c4b
      Patrick McHardy 提交于
      Currently the AF-specific hook functions override the chain-type specific
      hook functions. That doesn't make too much sense since the chain types
      are a special case of the AF-specific hooks.
      
      Make the AF-specific hook functions the default and make the optional
      chain type hooks override them.
      
      As a side effect, the necessary code restructuring reduces the code size,
      f.i. in case of nf_tables_ipv4.o:
      
        nf_tables_ipv4_init_net   |  -24
        nft_do_chain_ipv4         | -113
       2 functions changed, 137 bytes removed, diff: -137
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3b088c4b
    • J
      net-gre-gro: Add GRE support to the GRO stack · bf5a755f
      Jerry Chu 提交于
      This patch built on top of Commit 299603e8
      ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
      the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
      stack. It also serves as an example for supporting other encapsulation
      protocols in the GRO stack in the future.
      
      The patch supports version 0 and all the flags (key, csum, seq#) but
      will flush any pkt with the S (seq#) flag. This is because the S flag
      is not support by GSO, and a GRO pkt may end up in the forwarding path,
      thus requiring GSO support to break it up correctly.
      
      Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
      ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
      IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
      be easily added, so is the support for GRE variations like NVGRE.
      
      The patch also support csum offload. Specifically if the csum flag is on
      and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
      the code will take advantage of the csum computed by the h/w when
      validating the GRE csum.
      
      Note that commit 60769a5d "ipv4: gre:
      add GRO capability" already introduces GRO capability to IPv4 GRE
      tunnels, using the gro_cells infrastructure. But GRO is done after
      GRE hdr has been removed (i.e., decapped). The following patch applies
      GRO when pkts first come in (before hitting the GRE tunnel code). There
      is some performance advantage for applying GRO as early as possible.
      Also this approach is transparent to other subsystem like Open vSwitch
      where GRE decap is handled outside of the IP stack hence making it
      harder for the gro_cells stuff to apply. On the other hand, some NICs
      are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
      In that case the GRO processing of pkts from the same remote host will
      all happen on the same CPU and the performance may be suboptimal.
      
      I'm including some rough preliminary performance numbers below. Note
      that the performance will be highly dependent on traffic load, mix as
      usual. Moreover it also depends on NIC offload features hence the
      following is by no means a comprehesive study. Local testing and tuning
      will be needed to decide the best setting.
      
      All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
      (super_netperf 50 -H 192.168.1.18 -l 30)
      
      An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
      mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
      is configured.
      
      The GRO support for pkts AFTER decap are controlled through the device
      feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
      
      1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
      thruput: 9.16Gbps
      CPU utilization: 19%
      
      1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
      thruput: 5.9Gbps
      CPU utilization: 15%
      
      1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
      thruput: 9.26Gbps
      CPU utilization: 12-13%
      
      1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
      thruput: 9.26Gbps
      CPU utilization: 10%
      
      The following tests were performed on a different NIC that is capable of
      csum offload. I.e., the h/w is capable of computing IP payload csum
      (CHECKSUM_COMPLETE).
      
      2.1 ethtool -K gre1 gro on (hence will use gro_cells)
      
      2.1.1 ethtool -K eth0 gro off; csum offload disabled
      thruput: 8.53Gbps
      CPU utilization: 9%
      
      2.1.2 ethtool -K eth0 gro off; csum offload enabled
      thruput: 8.97Gbps
      CPU utilization: 7-8%
      
      2.1.3 ethtool -K eth0 gro on; csum offload disabled
      thruput: 8.83Gbps
      CPU utilization: 5-6%
      
      2.1.4 ethtool -K eth0 gro on; csum offload enabled
      thruput: 8.98Gbps
      CPU utilization: 5%
      
      2.2 ethtool -K gre1 gro off
      
      2.2.1 ethtool -K eth0 gro off; csum offload disabled
      thruput: 5.93Gbps
      CPU utilization: 9%
      
      2.2.2 ethtool -K eth0 gro off; csum offload enabled
      thruput: 5.62Gbps
      CPU utilization: 8%
      
      2.2.3 ethtool -K eth0 gro on; csum offload disabled
      thruput: 7.69Gbps
      CPU utilization: 8%
      
      2.2.4 ethtool -K eth0 gro on; csum offload enabled
      thruput: 8.96Gbps
      CPU utilization: 5-6%
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf5a755f
    • F
      IPv6: add the option to use anycast addresses as source addresses in echo reply · 509aba3b
      FX Le Bail 提交于
      This change allows to follow a recommandation of RFC4942.
      
      - Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
        as source addresses for ICMPv6 echo reply. This sysctl is false by default
        to preserve existing behavior.
      - Add inline check ipv6_anycast_destination().
      - Use them in icmpv6_echo_reply().
      
      Reference:
      RFC4942 - IPv6 Transition/Coexistence Security Considerations
         (http://tools.ietf.org/html/rfc4942#section-2.1.6)
      
      2.1.6. Anycast Traffic Identification and Security
      
         [...]
         To avoid exposing knowledge about the internal structure of the
         network, it is recommended that anycast servers now take advantage of
         the ability to return responses with the anycast address as the
         source address if possible.
      Signed-off-by: NFrancois-Xavier Le Bail <fx.lebail@yahoo.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      509aba3b
    • L
      ipv6: pcpu_tstats.syncp should be initialised in ip6_vti.c · 657e5d19
      Li RongQing 提交于
      initialise pcpu_tstats.syncp to kill the calltrace
      [   11.973950] Call Trace:
      [   11.973950]  [<819bbaff>] dump_stack+0x48/0x60
      [   11.973950]  [<819bbaff>] dump_stack+0x48/0x60
      [   11.973950]  [<81078dcf>] __lock_acquire.isra.22+0x1bf/0xc10
      [   11.973950]  [<81078dcf>] __lock_acquire.isra.22+0x1bf/0xc10
      [   11.973950]  [<81079fa7>] lock_acquire+0x77/0xa0
      [   11.973950]  [<81079fa7>] lock_acquire+0x77/0xa0
      [   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
      [   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
      [   11.973950]  [<8183862d>] ip_tunnel_get_stats64+0x6d/0x230
      [   11.973950]  [<8183862d>] ip_tunnel_get_stats64+0x6d/0x230
      [   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
      [   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
      [   11.973950]  [<811cf8c1>] ? __nla_reserve+0x21/0xd0
      [   11.973950]  [<811cf8c1>] ? __nla_reserve+0x21/0xd0
      [   11.973950]  [<817ca7ab>] dev_get_stats+0xcb/0x130
      [   11.973950]  [<817ca7ab>] dev_get_stats+0xcb/0x130
      [   11.973950]  [<817d5409>] rtnl_fill_ifinfo+0x569/0xe20
      [   11.973950]  [<817d5409>] rtnl_fill_ifinfo+0x569/0xe20
      [   11.973950]  [<810352e0>] ? kvm_clock_read+0x20/0x30
      [   11.973950]  [<810352e0>] ? kvm_clock_read+0x20/0x30
      [   11.973950]  [<81008e38>] ? sched_clock+0x8/0x10
      [   11.973950]  [<81008e38>] ? sched_clock+0x8/0x10
      [   11.973950]  [<8106ba45>] ? sched_clock_local+0x25/0x170
      [   11.973950]  [<8106ba45>] ? sched_clock_local+0x25/0x170
      [   11.973950]  [<810da6bd>] ? __kmalloc+0x3d/0x90
      [   11.973950]  [<810da6bd>] ? __kmalloc+0x3d/0x90
      [   11.973950]  [<817b8c10>] ? __kmalloc_reserve.isra.41+0x20/0x70
      [   11.973950]  [<817b8c10>] ? __kmalloc_reserve.isra.41+0x20/0x70
      [   11.973950]  [<810da81a>] ? slob_alloc_node+0x2a/0x60
      [   11.973950]  [<810da81a>] ? slob_alloc_node+0x2a/0x60
      [   11.973950]  [<817b919a>] ? __alloc_skb+0x6a/0x2b0
      [   11.973950]  [<817b919a>] ? __alloc_skb+0x6a/0x2b0
      [   11.973950]  [<817d8795>] rtmsg_ifinfo+0x65/0xe0
      [   11.973950]  [<817d8795>] rtmsg_ifinfo+0x65/0xe0
      [   11.973950]  [<817cbd31>] register_netdevice+0x531/0x5a0
      [   11.973950]  [<817cbd31>] register_netdevice+0x531/0x5a0
      [   11.973950]  [<81892b87>] ? ip6_tnl_get_cap+0x27/0x90
      [   11.973950]  [<81892b87>] ? ip6_tnl_get_cap+0x27/0x90
      [   11.973950]  [<817cbdb6>] register_netdev+0x16/0x30
      [   11.973950]  [<817cbdb6>] register_netdev+0x16/0x30
      [   11.973950]  [<81f574a6>] vti6_init_net+0x1c4/0x1d4
      [   11.973950]  [<81f574a6>] vti6_init_net+0x1c4/0x1d4
      [   11.973950]  [<81f573af>] ? vti6_init_net+0xcd/0x1d4
      [   11.973950]  [<81f573af>] ? vti6_init_net+0xcd/0x1d4
      [   11.973950]  [<817c16df>] ops_init.constprop.11+0x17f/0x1c0
      [   11.973950]  [<817c16df>] ops_init.constprop.11+0x17f/0x1c0
      [   11.973950]  [<817c1779>] register_pernet_operations.isra.9+0x59/0x90
      [   11.973950]  [<817c1779>] register_pernet_operations.isra.9+0x59/0x90
      [   11.973950]  [<817c18d1>] register_pernet_device+0x21/0x60
      [   11.973950]  [<817c18d1>] register_pernet_device+0x21/0x60
      [   11.973950]  [<81f574b6>] ? vti6_init_net+0x1d4/0x1d4
      [   11.973950]  [<81f574b6>] ? vti6_init_net+0x1d4/0x1d4
      [   11.973950]  [<81f574c7>] vti6_tunnel_init+0x11/0x68
      [   11.973950]  [<81f574c7>] vti6_tunnel_init+0x11/0x68
      [   11.973950]  [<81f572a1>] ? mip6_init+0x73/0xb4
      [   11.973950]  [<81f572a1>] ? mip6_init+0x73/0xb4
      [   11.973950]  [<81f0cba4>] do_one_initcall+0xbb/0x15b
      [   11.973950]  [<81f0cba4>] do_one_initcall+0xbb/0x15b
      [   11.973950]  [<811a00d8>] ? sha_transform+0x528/0x1150
      [   11.973950]  [<811a00d8>] ? sha_transform+0x528/0x1150
      [   11.973950]  [<81f0c544>] ? repair_env_string+0x12/0x51
      [   11.973950]  [<81f0c544>] ? repair_env_string+0x12/0x51
      [   11.973950]  [<8105c30d>] ? parse_args+0x2ad/0x440
      [   11.973950]  [<8105c30d>] ? parse_args+0x2ad/0x440
      [   11.973950]  [<810546be>] ? __usermodehelper_set_disable_depth+0x3e/0x50
      [   11.973950]  [<810546be>] ? __usermodehelper_set_disable_depth+0x3e/0x50
      [   11.973950]  [<81f0cd27>] kernel_init_freeable+0xe3/0x182
      [   11.973950]  [<81f0cd27>] kernel_init_freeable+0xe3/0x182
      [   11.973950]  [<81f0c532>] ? do_early_param+0x7a/0x7a
      [   11.973950]  [<81f0c532>] ? do_early_param+0x7a/0x7a
      [   11.973950]  [<819b5b1b>] kernel_init+0xb/0x100
      [   11.973950]  [<819b5b1b>] kernel_init+0xb/0x100
      [   11.973950]  [<819cebf7>] ret_from_kernel_thread+0x1b/0x28
      [   11.973950]  [<819cebf7>] ret_from_kernel_thread+0x1b/0x28
      [   11.973950]  [<819b5b10>] ? rest_init+0xc0/0xc0
      [   11.973950]  [<819b5b10>] ? rest_init+0xc0/0xc0
      
      Before 469bdcef ("ipv6: fix the use of pcpu_tstats in ip6_vti.c"),
      the pcpu_tstats.syncp is not used to pretect the 64bit elements of
      pcpu_tstats, so not appear this calltrace.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      657e5d19