1. 31 8月, 2015 1 次提交
  2. 29 8月, 2015 1 次提交
  3. 30 7月, 2015 1 次提交
  4. 16 7月, 2015 1 次提交
  5. 05 6月, 2015 2 次提交
    • T
      net: Add full IPv6 addresses to flow_keys · c3f83241
      Tom Herbert 提交于
      This patch adds full IPv6 addresses into flow_keys and uses them as
      input to the flow hash function. The implementation supports either
      IPv4 or IPv6 addresses in a union, and selector is used to determine
      how may words to input to jhash2.
      
      We also add flow_get_u32_dst and flow_get_u32_src functions which are
      used to get a u32 representation of the source and destination
      addresses. For IPv6, ipv6_addr_hash is called. These functions retain
      getting the legacy values of src and dst in flow_keys.
      
      With this patch, Ethertype and IP protocol are now included in the
      flow hash input.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f83241
    • T
      net: Get skb hash over flow_keys structure · 42aecaa9
      Tom Herbert 提交于
      This patch changes flow hashing to use jhash2 over the flow_keys
      structure instead just doing jhash_3words over src, dst, and ports.
      This method will allow us take more input into the hashing function
      so that we can include full IPv6 addresses, VLAN, flow labels etc.
      without needing to resort to xor'ing which makes for a poor hash.
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42aecaa9
  6. 28 5月, 2015 1 次提交
    • F
      ip_fragment: don't forward defragmented DF packet · d6b915e2
      Florian Westphal 提交于
      We currently always send fragments without DF bit set.
      
      Thus, given following setup:
      
      mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
         A           R1              R2         B
      
      Where R1 and R2 run linux with netfilter defragmentation/conntrack
      enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
      will respond with icmp too big error if one of these fragments exceeded
      1400 bytes.
      
      However, if R1 receives fragment sizes 1200 and 100, it would
      forward the reassembled packet without refragmenting, i.e.
      R2 will send an icmp error in response to a packet that was never sent,
      citing mtu that the original sender never exceeded.
      
      The other minor issue is that a refragmentation on R1 will conceal the
      MTU of R2-B since refragmentation does not set DF bit on the fragments.
      
      This modifies ip_fragment so that we track largest fragment size seen
      both for DF and non-DF packets, and set frag_max_size to the largest
      value.
      
      If the DF fragment size is larger or equal to the non-df one, we will
      consider the packet a path mtu probe:
      We set DF bit on the reassembled skb and also tag it with a new IPCB flag
      to force refragmentation even if skb fits outdev mtu.
      
      We will also set DF bit on each fragment in this case.
      
      Joint work with Hannes Frederic Sowa.
      Reported-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6b915e2
  7. 20 5月, 2015 1 次提交
  8. 19 5月, 2015 2 次提交
  9. 14 5月, 2015 3 次提交
  10. 08 4月, 2015 1 次提交
    • D
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller 提交于
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7026b1dd
  11. 07 4月, 2015 1 次提交
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990
  12. 26 3月, 2015 1 次提交
  13. 02 2月, 2015 1 次提交
    • E
      ipv4: tcp: get rid of ugly unicast_sock · bdbbb852
      Eric Dumazet 提交于
      In commit be9f4a44 ("ipv4: tcp: remove per net tcp_sock")
      I tried to address contention on a socket lock, but the solution
      I chose was horrible :
      
      commit 3a7c384f ("ipv4: tcp: unicast_sock should not land outside
      of TCP stack") addressed a selinux regression.
      
      commit 0980e56e ("ipv4: tcp: set unicast_sock uc_ttl to -1")
      took care of another regression.
      
      commit b5ec8eea ("ipv4: fix ip_send_skb()") fixed another regression.
      
      commit 811230cd ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
      was another shot in the dark.
      
      Really, just use a proper socket per cpu, and remove the skb_orphan()
      call, to re-enable flow control.
      
      This solves a serious problem with FQ packet scheduler when used in
      hostile environments, as we do not want to allocate a flow structure
      for every RST packet sent in response to a spoofed packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdbbb852
  14. 27 1月, 2015 1 次提交
  15. 06 1月, 2015 1 次提交
  16. 29 9月, 2014 1 次提交
  17. 24 9月, 2014 1 次提交
    • E
      icmp: add a global rate limitation · 4cdf507d
      Eric Dumazet 提交于
      Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
      protected by a lock, meaning that hosts can be stuck hard if all cpus
      want to check ICMP limits.
      
      When say a DNS or NTP server process is restarted, inetpeer tree grows
      quick and machine comes to its knees.
      
      iptables can not help because the bottleneck happens before ICMP
      messages are even cooked and sent.
      
      This patch adds a new global limitation, using a token bucket filter,
      controlled by two new sysctl :
      
      icmp_msgs_per_sec - INTEGER
          Limit maximal number of ICMP packets sent per second from this host.
          Only messages whose type matches icmp_ratemask are
          controlled by this limit.
          Default: 1000
      
      icmp_msgs_burst - INTEGER
          icmp_msgs_per_sec controls number of ICMP packets sent per second,
          while icmp_msgs_burst controls the burst size of these packets.
          Default: 50
      
      Note that if we really want to send millions of ICMP messages per
      second, we might extend idea and infra added in commit 04ca6973
      ("ip: make IP identifiers less predictable") :
      add a token bucket in the ip_idents hash and no longer rely on inetpeer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cdf507d
  18. 10 9月, 2014 1 次提交
  19. 25 8月, 2014 1 次提交
  20. 30 7月, 2014 1 次提交
  21. 29 7月, 2014 1 次提交
    • E
      ip: make IP identifiers less predictable · 04ca6973
      Eric Dumazet 提交于
      In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
      Jedidiah describe ways exploiting linux IP identifier generation to
      infer whether two machines are exchanging packets.
      
      With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
      changed IP id generation, but this does not really prevent this
      side-channel technique.
      
      This patch adds a random amount of perturbation so that IP identifiers
      for a given destination [1] are no longer monotonically increasing after
      an idle period.
      
      Note that prandom_u32_max(1) returns 0, so if generator is used at most
      once per jiffy, this patch inserts no hole in the ID suite and do not
      increase collision probability.
      
      This is jiffies based, so in the worst case (HZ=1000), the id can
      rollover after ~65 seconds of idle time, which should be fine.
      
      We also change the hash used in __ip_select_ident() to not only hash
      on daddr, but also saddr and protocol, so that ICMP probes can not be
      used to infer information for other protocols.
      
      For IPv6, adds saddr into the hash as well, but not nexthdr.
      
      If I ping the patched target, we can see ID are now hard to predict.
      
      21:57:11.008086 IP (...)
          A > target: ICMP echo request, seq 1, length 64
      21:57:11.010752 IP (... id 2081 ...)
          target > A: ICMP echo reply, seq 1, length 64
      
      21:57:12.013133 IP (...)
          A > target: ICMP echo request, seq 2, length 64
      21:57:12.015737 IP (... id 3039 ...)
          target > A: ICMP echo reply, seq 2, length 64
      
      21:57:13.016580 IP (...)
          A > target: ICMP echo request, seq 3, length 64
      21:57:13.019251 IP (... id 3437 ...)
          target > A: ICMP echo reply, seq 3, length 64
      
      [1] TCP sessions uses a per flow ID generator not changed by this patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJeffrey Knockel <jeffk@cs.unm.edu>
      Reported-by: NJedidiah R. Crandall <crandall@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ca6973
  22. 28 7月, 2014 1 次提交
  23. 08 7月, 2014 1 次提交
    • T
      net: Save TX flow hash in sock and set in skbuf on xmit · b73c3d0e
      Tom Herbert 提交于
      For a connected socket we can precompute the flow hash for setting
      in skb->hash on output. This is a performance advantage over
      calculating the skb->hash for every packet on the connection. The
      computation is done using the common hash algorithm to be consistent
      with computations done for packets of the connection in other states
      where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).
      
      This patch adds sk_txhash to the sock structure. inet_set_txhash and
      ip6_set_txhash functions are added which are called from points in
      TCP and UDP where socket moves to established state.
      
      skb_set_hash_from_sk is a function which sets skb->hash from the
      sock txhash value. This is called in UDP and TCP transmit path when
      transmitting within the context of a socket.
      
      Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
      interface (in this case skb_get_hash called on every TX packet to
      create a UDP source port).
      
      Before fix:
      
        95.02% CPU utilization
        154/256/505 90/95/99% latencies
        1.13042e+06 tps
      
        Time in functions:
          0.28% skb_flow_dissect
          0.21% __skb_get_hash
      
      After fix:
      
        94.95% CPU utilization
        156/254/485 90/95/99% latencies
        1.15447e+06
      
        Neither __skb_get_hash nor skb_flow_dissect appear in perf
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b73c3d0e
  24. 03 6月, 2014 1 次提交
    • E
      inetpeer: get rid of ip_id_count · 73f156a6
      Eric Dumazet 提交于
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f156a6
  25. 16 5月, 2014 1 次提交
  26. 15 5月, 2014 1 次提交
  27. 14 5月, 2014 1 次提交
    • L
      net: add a sysctl to reflect the fwmark on replies · e110861f
      Lorenzo Colitti 提交于
      Kernel-originated IP packets that have no user socket associated
      with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
      are emitted with a mark of zero. Add a sysctl to make them have
      the same mark as the packet they are replying to.
      
      This allows an administrator that wishes to do so to use
      mark-based routing, firewalling, etc. for these replies by
      marking the original packets inbound.
      
      Tested using user-mode linux:
       - ICMP/ICMPv6 echo replies and errors.
       - TCP RST packets (IPv4 and IPv6).
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e110861f
  28. 13 5月, 2014 1 次提交
  29. 08 5月, 2014 1 次提交
    • W
      net: clean up snmp stats code · 698365fa
      WANG Cong 提交于
      commit 8f0ea0fe (snmp: reduce percpu needs by 50%)
      reduced snmp array size to 1, so technically it doesn't have to be
      an array any more. What's more, after the following commit:
      
      	commit 933393f5
      	Date:   Thu Dec 22 11:58:51 2011 -0600
      
      	    percpu: Remove irqsafe_cpu_xxx variants
      
      	    We simply say that regular this_cpu use must be safe regardless of
      	    preemption and interrupt state.  That has no material change for x86
      	    and s390 implementations of this_cpu operations.  However, arches that
      	    do not provide their own implementation for this_cpu operations will
      	    now get code generated that disables interrupts instead of preemption.
      
      probably no arch wants to have SNMP_ARRAY_SZ == 2. At least after
      almost 3 years, no one complains.
      
      So, just convert the array to a single pointer and remove snmp_mib_init()
      and snmp_mib_free() as well.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      698365fa
  30. 06 5月, 2014 1 次提交
  31. 16 4月, 2014 2 次提交
  32. 07 3月, 2014 1 次提交
  33. 27 2月, 2014 1 次提交
    • H
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa 提交于
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b346576
  34. 20 2月, 2014 1 次提交
  35. 20 1月, 2014 1 次提交
    • H
      ipv6: make IPV6_RECVPKTINFO work for ipv4 datagrams · 4b261c75
      Hannes Frederic Sowa 提交于
      We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
      for IPv4 datagrams on IPv6 sockets.
      
      This patch splits the ip6_datagram_recv_ctl into two functions, one
      which handles both protocol families, AF_INET and AF_INET6, while the
      ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.
      
      ip6_datagram_recv_*_ctl never reported back any errors, so we can make
      them return void. Also provide a helper for protocols which don't offer dual
      personality to further use ip6_datagram_recv_ctl, which is exported to
      modules.
      
      I needed to shuffle the code for ping around a bit to make it easier to
      implement dual personality for ping ipv6 sockets in future.
      Reported-by: NGert Doering <gert@space.net>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b261c75