1. 29 9月, 2014 1 次提交
  2. 24 9月, 2014 1 次提交
    • E
      icmp: add a global rate limitation · 4cdf507d
      Eric Dumazet 提交于
      Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
      protected by a lock, meaning that hosts can be stuck hard if all cpus
      want to check ICMP limits.
      
      When say a DNS or NTP server process is restarted, inetpeer tree grows
      quick and machine comes to its knees.
      
      iptables can not help because the bottleneck happens before ICMP
      messages are even cooked and sent.
      
      This patch adds a new global limitation, using a token bucket filter,
      controlled by two new sysctl :
      
      icmp_msgs_per_sec - INTEGER
          Limit maximal number of ICMP packets sent per second from this host.
          Only messages whose type matches icmp_ratemask are
          controlled by this limit.
          Default: 1000
      
      icmp_msgs_burst - INTEGER
          icmp_msgs_per_sec controls number of ICMP packets sent per second,
          while icmp_msgs_burst controls the burst size of these packets.
          Default: 50
      
      Note that if we really want to send millions of ICMP messages per
      second, we might extend idea and infra added in commit 04ca6973
      ("ip: make IP identifiers less predictable") :
      add a token bucket in the ip_idents hash and no longer rely on inetpeer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cdf507d
  3. 10 9月, 2014 1 次提交
  4. 25 8月, 2014 1 次提交
  5. 30 7月, 2014 1 次提交
  6. 29 7月, 2014 1 次提交
    • E
      ip: make IP identifiers less predictable · 04ca6973
      Eric Dumazet 提交于
      In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
      Jedidiah describe ways exploiting linux IP identifier generation to
      infer whether two machines are exchanging packets.
      
      With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
      changed IP id generation, but this does not really prevent this
      side-channel technique.
      
      This patch adds a random amount of perturbation so that IP identifiers
      for a given destination [1] are no longer monotonically increasing after
      an idle period.
      
      Note that prandom_u32_max(1) returns 0, so if generator is used at most
      once per jiffy, this patch inserts no hole in the ID suite and do not
      increase collision probability.
      
      This is jiffies based, so in the worst case (HZ=1000), the id can
      rollover after ~65 seconds of idle time, which should be fine.
      
      We also change the hash used in __ip_select_ident() to not only hash
      on daddr, but also saddr and protocol, so that ICMP probes can not be
      used to infer information for other protocols.
      
      For IPv6, adds saddr into the hash as well, but not nexthdr.
      
      If I ping the patched target, we can see ID are now hard to predict.
      
      21:57:11.008086 IP (...)
          A > target: ICMP echo request, seq 1, length 64
      21:57:11.010752 IP (... id 2081 ...)
          target > A: ICMP echo reply, seq 1, length 64
      
      21:57:12.013133 IP (...)
          A > target: ICMP echo request, seq 2, length 64
      21:57:12.015737 IP (... id 3039 ...)
          target > A: ICMP echo reply, seq 2, length 64
      
      21:57:13.016580 IP (...)
          A > target: ICMP echo request, seq 3, length 64
      21:57:13.019251 IP (... id 3437 ...)
          target > A: ICMP echo reply, seq 3, length 64
      
      [1] TCP sessions uses a per flow ID generator not changed by this patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJeffrey Knockel <jeffk@cs.unm.edu>
      Reported-by: NJedidiah R. Crandall <crandall@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ca6973
  7. 28 7月, 2014 1 次提交
  8. 08 7月, 2014 1 次提交
    • T
      net: Save TX flow hash in sock and set in skbuf on xmit · b73c3d0e
      Tom Herbert 提交于
      For a connected socket we can precompute the flow hash for setting
      in skb->hash on output. This is a performance advantage over
      calculating the skb->hash for every packet on the connection. The
      computation is done using the common hash algorithm to be consistent
      with computations done for packets of the connection in other states
      where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).
      
      This patch adds sk_txhash to the sock structure. inet_set_txhash and
      ip6_set_txhash functions are added which are called from points in
      TCP and UDP where socket moves to established state.
      
      skb_set_hash_from_sk is a function which sets skb->hash from the
      sock txhash value. This is called in UDP and TCP transmit path when
      transmitting within the context of a socket.
      
      Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
      interface (in this case skb_get_hash called on every TX packet to
      create a UDP source port).
      
      Before fix:
      
        95.02% CPU utilization
        154/256/505 90/95/99% latencies
        1.13042e+06 tps
      
        Time in functions:
          0.28% skb_flow_dissect
          0.21% __skb_get_hash
      
      After fix:
      
        94.95% CPU utilization
        156/254/485 90/95/99% latencies
        1.15447e+06
      
        Neither __skb_get_hash nor skb_flow_dissect appear in perf
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b73c3d0e
  9. 03 6月, 2014 1 次提交
    • E
      inetpeer: get rid of ip_id_count · 73f156a6
      Eric Dumazet 提交于
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f156a6
  10. 16 5月, 2014 1 次提交
  11. 15 5月, 2014 1 次提交
  12. 14 5月, 2014 1 次提交
    • L
      net: add a sysctl to reflect the fwmark on replies · e110861f
      Lorenzo Colitti 提交于
      Kernel-originated IP packets that have no user socket associated
      with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
      are emitted with a mark of zero. Add a sysctl to make them have
      the same mark as the packet they are replying to.
      
      This allows an administrator that wishes to do so to use
      mark-based routing, firewalling, etc. for these replies by
      marking the original packets inbound.
      
      Tested using user-mode linux:
       - ICMP/ICMPv6 echo replies and errors.
       - TCP RST packets (IPv4 and IPv6).
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e110861f
  13. 13 5月, 2014 1 次提交
  14. 08 5月, 2014 1 次提交
    • W
      net: clean up snmp stats code · 698365fa
      WANG Cong 提交于
      commit 8f0ea0fe (snmp: reduce percpu needs by 50%)
      reduced snmp array size to 1, so technically it doesn't have to be
      an array any more. What's more, after the following commit:
      
      	commit 933393f5
      	Date:   Thu Dec 22 11:58:51 2011 -0600
      
      	    percpu: Remove irqsafe_cpu_xxx variants
      
      	    We simply say that regular this_cpu use must be safe regardless of
      	    preemption and interrupt state.  That has no material change for x86
      	    and s390 implementations of this_cpu operations.  However, arches that
      	    do not provide their own implementation for this_cpu operations will
      	    now get code generated that disables interrupts instead of preemption.
      
      probably no arch wants to have SNMP_ARRAY_SZ == 2. At least after
      almost 3 years, no one complains.
      
      So, just convert the array to a single pointer and remove snmp_mib_init()
      and snmp_mib_free() as well.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      698365fa
  15. 06 5月, 2014 1 次提交
  16. 16 4月, 2014 2 次提交
  17. 07 3月, 2014 1 次提交
  18. 27 2月, 2014 1 次提交
    • H
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa 提交于
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b346576
  19. 20 2月, 2014 1 次提交
  20. 20 1月, 2014 1 次提交
    • H
      ipv6: make IPV6_RECVPKTINFO work for ipv4 datagrams · 4b261c75
      Hannes Frederic Sowa 提交于
      We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
      for IPv4 datagrams on IPv6 sockets.
      
      This patch splits the ip6_datagram_recv_ctl into two functions, one
      which handles both protocol families, AF_INET and AF_INET6, while the
      ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.
      
      ip6_datagram_recv_*_ctl never reported back any errors, so we can make
      them return void. Also provide a helper for protocols which don't offer dual
      personality to further use ip6_datagram_recv_ctl, which is exported to
      modules.
      
      I needed to shuffle the code for ping around a bit to make it easier to
      implement dual personality for ping ipv6 sockets in future.
      Reported-by: NGert Doering <gert@space.net>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b261c75
  21. 15 1月, 2014 1 次提交
  22. 14 1月, 2014 1 次提交
    • H
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa 提交于
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f87c10a8
  23. 19 12月, 2013 1 次提交
  24. 24 11月, 2013 1 次提交
  25. 09 10月, 2013 2 次提交
    • E
      ipv6: make lookups simpler and faster · efe4208f
      Eric Dumazet 提交于
      TCP listener refactoring, part 4 :
      
      To speed up inet lookups, we moved IPv4 addresses from inet to struct
      sock_common
      
      Now is time to do the same for IPv6, because it permits us to have fast
      lookups for all kind of sockets, including upcoming SYN_RECV.
      
      Getting IPv6 addresses in TCP lookups currently requires two extra cache
      lines, plus a dereference (and memory stall).
      
      inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
      
      This patch is way bigger than its IPv4 counter part, because for IPv4,
      we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
      it's not doable easily.
      
      inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
      inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
      
      And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
      at the same offset.
      
      We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
      macro.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe4208f
    • S
      net: ipv4 only populate IP_PKTINFO when needed · fbf8866d
      Shawn Bohrer 提交于
      The since the removal of the routing cache computing
      fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
      packet received.  This has introduced a performance regression for some
      UDP workloads.
      
      This change skips populating the packet info for sockets that do not have
      IP_PKTINFO set.
      
      Benchmark results from a netperf UDP_RR test:
      Before 89789.68 transactions/s
      After  90587.62 transactions/s
      
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.63us RTT
      After  12.48us RTT
      Signed-off-by: NShawn Bohrer <sbohrer@rgmadvisors.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbf8866d
  26. 01 10月, 2013 1 次提交
  27. 29 9月, 2013 2 次提交
    • F
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco 提交于
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      RT_TOS and RT_CONN_FLAGS.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa661581
    • F
      ipv4: IP_TOS and IP_TTL can be specified as ancillary data · f02db315
      Francesco Fusco 提交于
      This patch enables the IP_TTL and IP_TOS values passed from userspace to
      be stored in the ipcm_cookie struct. Three fields are added to the struct:
      
      - the TTL, expressed as __u8.
        The allowed values are in the [1-255].
        A value of 0 means that the TTL is not specified.
      
      - the TOS, expressed as __s16.
        The allowed values are in the range [0,255].
        A value of -1 means that the TOS is not specified.
      
      - the priority, expressed as a char and computed when
        handling the ancillary data.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f02db315
  28. 22 9月, 2013 1 次提交
  29. 20 9月, 2013 1 次提交
    • A
      ip: generate unique IP identificator if local fragmentation is allowed · 703133de
      Ansis Atteka 提交于
      If local fragmentation is allowed, then ip_select_ident() and
      ip_select_ident_more() need to generate unique IDs to ensure
      correct defragmentation on the peer.
      
      For example, if IPsec (tunnel mode) has to encrypt large skbs
      that have local_df bit set, then all IP fragments that belonged
      to different ESP datagrams would have used the same identificator.
      If one of these IP fragments would get lost or reordered, then
      peer could possibly stitch together wrong IP fragments that did
      not belong to the same datagram. This would lead to a packet loss
      or data corruption.
      Signed-off-by: NAnsis Atteka <aatteka@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      703133de
  30. 03 9月, 2013 1 次提交
  31. 22 1月, 2013 1 次提交
  32. 27 8月, 2012 1 次提交
  33. 11 8月, 2012 1 次提交
  34. 20 7月, 2012 1 次提交
    • E
      ipv4: tcp: remove per net tcp_sock · be9f4a44
      Eric Dumazet 提交于
      tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
      per network namespace.
      
      This leads to bad behavior on multiqueue NICS, because many cpus
      contend for the socket lock and once socket lock is acquired, extra
      false sharing on various socket fields slow down the operations.
      
      To better resist to attacks, we use a percpu socket. Each cpu can
      run without contention, using appropriate memory (local node)
      
      Additional features :
      
      1) We also mirror the queue_mapping of the incoming skb, so that
      answers use the same queue if possible.
      
      2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()
      
      3) We now limit the number of in-flight RST/ACK [1] packets
      per cpu, instead of per namespace, and we honor the sysctl_wmem_default
      limit dynamically. (Prior to this patch, sysctl_wmem_default value was
      copied at boot time, so any further change would not affect tcp_sock
      limit)
      
      [1] These packets are only generated when no socket was matched for
      the incoming packet.
      Reported-by: NBill Sommerfeld <wsommerfeld@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be9f4a44
  35. 28 6月, 2012 1 次提交
  36. 23 6月, 2012 1 次提交
    • A
      ipv4: Add sysctl knob to control early socket demux · 6648bd7e
      Alexander Duyck 提交于
      This change is meant to add a control for disabling early socket demux.
      The main motivation behind this patch is to provide an option to disable
      the feature as it adds an additional cost to routing that reduces overall
      throughput by up to 5%.  For example one of my systems went from 12.1Mpps
      to 11.6 after the early socket demux was added.  It looks like the reason
      for the regression is that we are now having to perform two lookups, first
      the one for an established socket, and then the one for the routing table.
      
      By adding this patch and toggling the value for ip_early_demux to 0 I am
      able to get back to the 12.1Mpps I was previously seeing.
      
      [ Move local variables in ip_rcv_finish() down into the basic
        block in which they are actually used.  -DaveM ]
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6648bd7e
  37. 16 5月, 2012 1 次提交
    • P
      net: delete all instances of special processing for token ring · 211ed865
      Paul Gortmaker 提交于
      We are going to delete the Token ring support.  This removes any
      special processing in the core networking for token ring, (aside
      from net/tr.c itself), leaving the drivers and remaining tokenring
      support present but inert.
      
      The mass removal of the drivers and net/tr.c will be in a separate
      commit, so that the history of these files that we still care
      about won't have the giant deletion tied into their history.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      211ed865