1. 08 11月, 2013 1 次提交
    • E
      inet: fix a UFO regression · dcd60771
      Eric Dumazet 提交于
      While testing virtio_net and skb_segment() changes, Hannes reported
      that UFO was sending wrong frames.
      
      It appears this was introduced by a recent commit :
      8c3a897b ("inet: restore gso for vxlan")
      
      The old condition to perform IP frag was :
      
      tunnel = !!skb->encapsulation;
      ...
              if (!tunnel && proto == IPPROTO_UDP) {
      
      So the new one should be :
      
      udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
      ...
              if (udpfrag) {
      
      Initialization of udpfrag must be done before call
      to ops->callbacks.gso_segment(skb, features), as
      skb_udp_tunnel_segment() clears skb->encapsulation
      
      (We want udpfrag to be true for UFO, false for VXLAN)
      
      With help from Alexei Starovoitov
      Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcd60771
  2. 06 11月, 2013 1 次提交
    • H
      ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE · 482fc609
      Hannes Frederic Sowa 提交于
      Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
      their sockets won't accept and install new path mtu information and they
      will always use the interface mtu for outgoing packets. It is guaranteed
      that the packet is not fragmented locally. But we won't set the DF-Flag
      on the outgoing frames.
      
      Florian Weimer had the idea to use this flag to ensure DNS servers are
      never generating outgoing fragments. They may well be fragmented on the
      path, but the server never stores or usees path mtu values, which could
      well be forged in an attack.
      
      (The root of the problem with path MTU discovery is that there is
      no reliable way to authenticate ICMP Fragmentation Needed But DF Set
      messages because they are sent from intermediate routers with their
      source addresses, and the IMCP payload will not always contain sufficient
      information to identify a flow.)
      
      Recent research in the DNS community showed that it is possible to
      implement an attack where DNS cache poisoning is feasible by spoofing
      fragments. This work was done by Amir Herzberg and Haya Shulman:
      <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>
      
      This issue was previously discussed among the DNS community, e.g.
      <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
      without leading to fixes.
      
      This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
      regarding local fragmentation with UFO/CORK" for the enforcement of the
      non-fragmentable checks. If other users than ip_append_page/data should
      use this semantic too, we have to add a new flag to IPCB(skb)->flags to
      suppress local fragmentation and check for this in ip_finish_output.
      
      Many thanks to Florian Weimer for the idea and feedback while implementing
      this patch.
      
      Cc: David S. Miller <davem@davemloft.net>
      Suggested-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      482fc609
  3. 05 11月, 2013 2 次提交
    • Y
      tcp: properly handle stretch acks in slow start · 9f9843a7
      Yuchung Cheng 提交于
      Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
      regardless the number of packets. Consequently slow start performance
      is highly dependent on the degree of the stretch ACKs caused by
      receiver or network ACK compression mechanisms (e.g., delayed-ACK,
      GRO, etc).  But slow start algorithm is to send twice the amount of
      packets of packets left so it should process a stretch ACK of degree
      N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
      follow up patch will use the remainder of the N (if greater than 1)
      to adjust cwnd in the congestion avoidance phase.
      
      In addition this patch retires the experimental limited slow start
      (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
      fractional cwnd increase in LSS requires a loop in slow start even
      though it's rarely used. Configuring such an increase step via a global
      sysctl on different BDPS seems hard. Finally and most importantly the
      slow start overshoot concern is now better covered by the Hybrid slow
      start (hystart) enabled by default.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9843a7
    • Y
      tcp: enable sockets to use MSG_FASTOPEN by default · 0d41cca4
      Yuchung Cheng 提交于
      Applications have started to use Fast Open (e.g., Chrome browser has
      such an optional flag) and the feature has gone through several
      generations of kernels since 3.7 with many real network tests. It's
      time to enable this flag by default for applications to test more
      conveniently and extensively.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d41cca4
  4. 04 11月, 2013 1 次提交
  5. 01 11月, 2013 1 次提交
  6. 30 10月, 2013 1 次提交
    • Y
      tcp: temporarily disable Fast Open on SYN timeout · c968601d
      Yuchung Cheng 提交于
      Fast Open currently has a fall back feature to address SYN-data being
      dropped but it requires the middle-box to pass on regular SYN retry
      after SYN-data. This is implemented in commit aab48743 ("net-tcp:
      Fast Open client - detecting SYN-data drops")
      
      However some NAT boxes will drop all subsequent packets after first
      SYN-data and blackholes the entire connections.  An example is in
      commit 356d7d88 "netfilter: nf_conntrack: fix tcp_in_window for Fast
      Open".
      
      The sender should note such incidents and fall back to use the regular
      TCP handshake on subsequent attempts temporarily as well: after the
      second SYN timeouts the original Fast Open SYN is most likely lost.
      When such an event recurs Fast Open is disabled based on the number of
      recurrences exponentially.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c968601d
  7. 29 10月, 2013 4 次提交
    • M
      net: esp{4,6}: get rid of struct esp_data · 1c5ad13f
      Mathias Krause 提交于
      struct esp_data consists of a single pointer, vanishing the need for it
      to be a structure. Fold the pointer into 'data' direcly, removing one
      level of pointer indirection.
      Signed-off-by: NMathias Krause <mathias.krause@secunet.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      1c5ad13f
    • M
      net: esp{4,6}: remove padlen from struct esp_data · 123b0d1b
      Mathias Krause 提交于
      The padlen member of struct esp_data is always zero. Get rid of it.
      Signed-off-by: NMathias Krause <mathias.krause@secunet.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      123b0d1b
    • H
      ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK · daba287b
      Hannes Frederic Sowa 提交于
      UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and
      IP_PMTUDISC_PROBE well enough.
      
      UFO enabled packet delivery just appends all frags to the cork and hands
      it over to the network card. So we just deliver non-DF udp fragments
      (DF-flag may get overwritten by hardware or virtual UFO enabled
      interface).
      
      UDP_CORK does enqueue the data until the cork is disengaged. At this
      point it sets the correct IP_DF and local_df flags and hands it over to
      ip_fragment which in this case will generate an icmp error which gets
      appended to the error socket queue. This is not reflected in the syscall
      error (of course, if UFO is enabled this also won't happen).
      
      Improve this by checking the pmtudisc flags before appending data to the
      socket and if we still can fit all data in one packet when IP_PMTUDISC_DO
      or IP_PMTUDISC_PROBE is set, only then proceed.
      
      We use (mtu-fragheaderlen) to check for the maximum length because we
      ensure not to generate a fragment and non-fragmented data does not need
      to have its length aligned on 64 bit boundaries. Also the passed in
      ip_options are already aligned correctly.
      
      Maybe, we can relax some other checks around ip_fragment. This needs
      more research.
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      daba287b
    • E
      tcp: gso: fix truesize tracking · 0d08c42c
      Eric Dumazet 提交于
      commit 6ff50cd5 ("tcp: gso: do not generate out of order packets")
      had an heuristic that can trigger a warning in skb_try_coalesce(),
      because skb->truesize of the gso segments were exactly set to mss.
      
      This breaks the requirement that
      
      skb->truesize >= skb->len + truesizeof(struct sk_buff);
      
      It can trivially be reproduced by :
      
      ifconfig lo mtu 1500
      ethtool -K lo tso off
      netperf
      
      As the skbs are looped into the TCP networking stack, skb_try_coalesce()
      warns us of these skb under-estimating their truesize.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d08c42c
  8. 28 10月, 2013 5 次提交
  9. 24 10月, 2013 2 次提交
  10. 22 10月, 2013 9 次提交
    • W
      netfilter: x_tables: fix ordering of jumpstack allocation and table update · b416c144
      Will Deacon 提交于
      During kernel stability testing on an SMP ARMv7 system, Yalin Wang
      reported the following panic from the netfilter code:
      
        1fe0: 0000001c 5e2d3b10 4007e779 4009e110 60000010 00000032 ff565656 ff545454
        [<c06c48dc>] (ipt_do_table+0x448/0x584) from [<c0655ef0>] (nf_iterate+0x48/0x7c)
        [<c0655ef0>] (nf_iterate+0x48/0x7c) from [<c0655f7c>] (nf_hook_slow+0x58/0x104)
        [<c0655f7c>] (nf_hook_slow+0x58/0x104) from [<c0683bbc>] (ip_local_deliver+0x88/0xa8)
        [<c0683bbc>] (ip_local_deliver+0x88/0xa8) from [<c0683718>] (ip_rcv_finish+0x418/0x43c)
        [<c0683718>] (ip_rcv_finish+0x418/0x43c) from [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598)
        [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598) from [<c062b314>] (process_backlog+0x84/0x158)
        [<c062b314>] (process_backlog+0x84/0x158) from [<c062de84>] (net_rx_action+0x70/0x1dc)
        [<c062de84>] (net_rx_action+0x70/0x1dc) from [<c0088230>] (__do_softirq+0x11c/0x27c)
        [<c0088230>] (__do_softirq+0x11c/0x27c) from [<c008857c>] (do_softirq+0x44/0x50)
        [<c008857c>] (do_softirq+0x44/0x50) from [<c0088614>] (local_bh_enable_ip+0x8c/0xd0)
        [<c0088614>] (local_bh_enable_ip+0x8c/0xd0) from [<c06b0330>] (inet_stream_connect+0x164/0x298)
        [<c06b0330>] (inet_stream_connect+0x164/0x298) from [<c061d68c>] (sys_connect+0x88/0xc8)
        [<c061d68c>] (sys_connect+0x88/0xc8) from [<c000e340>] (ret_fast_syscall+0x0/0x30)
        Code: 2a000021 e59d2028 e59de01c e59f011c (e7824103)
        ---[ end trace da227214a82491bd ]---
        Kernel panic - not syncing: Fatal exception in interrupt
      
      This comes about because CPU1 is executing xt_replace_table in response
      to a setsockopt syscall, resulting in:
      
      	ret = xt_jumpstack_alloc(newinfo);
      		--> newinfo->jumpstack = kzalloc(size, GFP_KERNEL);
      
      	[...]
      
      	table->private = newinfo;
      	newinfo->initial_entries = private->initial_entries;
      
      Meanwhile, CPU0 is handling the network receive path and ends up in
      ipt_do_table, resulting in:
      
      	private = table->private;
      
      	[...]
      
      	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
      
      On weakly ordered memory architectures, the writes to table->private
      and newinfo->jumpstack from CPU1 can be observed out of order by CPU0.
      Furthermore, on architectures which don't respect ordering of address
      dependencies (i.e. Alpha), the reads from CPU0 can also be re-ordered.
      
      This patch adds an smp_wmb() before the assignment to table->private
      (which is essentially publishing newinfo) to ensure that all writes to
      newinfo will be observed before plugging it into the table structure.
      A dependent-read barrier is also added on the consumer sides, to ensure
      the same ordering requirements are also respected there.
      
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NWang, Yalin <Yalin.Wang@sonymobile.com>
      Tested-by: NWang, Yalin <Yalin.Wang@sonymobile.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b416c144
    • N
      tcp: initialize passive-side sk_pacing_rate after 3WHS · 02cf4ebd
      Neal Cardwell 提交于
      For passive TCP connections, upon receiving the ACK that completes the
      3WHS, make sure we set our pacing rate after we get our first RTT
      sample.
      
      On passive TCP connections, when we receive the ACK completing the
      3WHS we do not take an RTT sample in tcp_ack(), but rather in
      tcp_synack_rtt_meas(). So upon receiving the ACK that completes the
      3WHS, tcp_ack() leaves sk_pacing_rate at its initial value.
      
      Originally the initial sk_pacing_rate value was 0, so passive-side
      connections defaulted to sysctl_tcp_min_tso_segs (2 segs) in skbuffs
      made in the first RTT. With a default initial cwnd of 10 packets, this
      happened to be correct for RTTs 5ms or bigger, so it was hard to
      see problems in WAN or emulated WAN testing.
      
      Since 7eec4174 ("pkt_sched: fq: fix non TCP flows pacing"), the
      initial sk_pacing_rate is 0xffffffff. So after that change, passive
      TCP connections were keeping this value (and using large numbers of
      segments per skbuff) until receiving an ACK for data.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02cf4ebd
    • E
      ipv6: sit: add GSO/TSO support · 61c1db7f
      Eric Dumazet 提交于
      Now ipv6_gso_segment() is stackable, its relatively easy to
      implement GSO/TSO support for SIT tunnels
      
      Performance results, when segmentation is done after tunnel
      device (as no NIC is yet enabled for TSO SIT support) :
      
      Before patch :
      
      lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
      MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  16384  16384    10.00      3168.31   4.81     4.64     2.988   2.877
      
      After patch :
      
      lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
      MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  16384  16384    10.00      5525.00   7.76     5.17     2.763   1.840
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61c1db7f
    • E
      ipv4: Allow unprivileged users to use per net sysctls · fd2d5356
      Eric W. Biederman 提交于
      Allow unprivileged users to use:
      /proc/sys/net/ipv4/icmp_echo_ignore_all
      /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
      /proc/sys/net/ipv4/icmp_ignore_bogus_error_response
      /proc/sys/net/ipv4/icmp_errors_use_inbound_ifaddr
      /proc/sys/net/ipv4/icmp_ratelimit
      /proc/sys/net/ipv4/icmp_ratemask
      /proc/sys/net/ipv4/ping_group_range
      /proc/sys/net/ipv4/tcp_ecn
      /proc/sys/net/ipv4/ip_local_ports_range
      
      These are occassionally handy and after a quick review I don't see
      any problems with unprivileged users using them.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd2d5356
    • E
      ipv4: Use math to point per net sysctls into the appropriate struct net. · 0a6fa23d
      Eric W. Biederman 提交于
      Simplify maintenance of ipv4_net_table by using math to point the per
      net sysctls into the appropriate struct net, instead of manually
      reassinging all of the variables into hard coded table slots.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a6fa23d
    • E
      tcp_memcontrol: Kill struct tcp_memcontrol · 2e685cad
      Eric W. Biederman 提交于
      Replace the pointers in struct cg_proto with actual data fields and kill
      struct tcp_memcontrol as it is not fully redundant.
      
      This removes a confusing, unnecessary layer of abstraction.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e685cad
    • E
      tcp_memcontrol: Remove the per netns control. · a4fe34bf
      Eric W. Biederman 提交于
      The code that is implemented is per memory cgroup not per netns, and
      having per netns bits is just confusing.  Remove the per netns bits to
      make it easier to see what is really going on.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4fe34bf
    • E
      tcp_memcontrol: Remove setting cgroup settings via sysctl · f594d631
      Eric W. Biederman 提交于
      The code is broken and does not constrain sysctl_tcp_mem as
      tcp_update_limit does.  With the result that it allows the cgroup tcp
      memory limits to be bypassed.
      
      The semantics are broken as the settings are not per netns and are in a
      per netns table, and instead looks at current.
      
      Since the code is broken in both design and implementation and does not
      implement the functionality for which it was written remove it.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f594d631
    • E
      tcp_memcontrol: Remove tcp_max_memory · cd91cce6
      Eric W. Biederman 提交于
      This function is never called. Remove it.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd91cce6
  11. 20 10月, 2013 10 次提交
  12. 19 10月, 2013 3 次提交