1. 14 2月, 2013 1 次提交
    • A
      tcp: adding a per-socket timestamp offset · ceaa1fef
      Andrey Vagin 提交于
      This functionality is used for restoring tcp sockets. A tcp timestamp
      depends on how long a system has been running, so it's differ for each
      host. The solution is to set a per-socket offset.
      
      A per-socket offset for a TIME_WAIT socket is inherited from a proper
      tcp socket.
      
      tcp_request_sock doesn't have a timestamp offset, because the repair
      mode for them are not implemented.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ceaa1fef
  2. 11 2月, 2013 1 次提交
  3. 07 2月, 2013 1 次提交
  4. 06 2月, 2013 2 次提交
  5. 05 2月, 2013 3 次提交
  6. 04 2月, 2013 2 次提交
  7. 01 2月, 2013 1 次提交
  8. 30 1月, 2013 6 次提交
    • N
      tcp: Increment LISTENOVERFLOW and LISTENDROPS in tcp_v4_conn_request() · 2aeef18d
      Nivedita Singhvi 提交于
      We drop a connection request if the accept backlog is full and there are
      sufficient packets in the syn queue to warrant starting drops. Increment the
      appropriate counters so this isn't silent, for accurate stats and help in
      debugging.
      
      This patch assumes LINUX_MIB_LISTENDROPS is a superset of/includes the
      counter LINUX_MIB_LISTENOVERFLOWS.
      Signed-off-by: NNivedita Singhvi <niv@us.ibm.com>
      Acked-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2aeef18d
    • D
      ip_gre: When TOS is inherited, use configured TOS value for non-IP packets · 040468a0
      David Ward 提交于
      A GRE tunnel can be configured so that outgoing tunnel packets inherit
      the value of the TOS field from the inner IP header. In doing so, when
      a non-IP packet is transmitted through the tunnel, the TOS field will
      always be set to 0.
      
      Instead, the user should be able to configure a different TOS value as
      the fallback to use for non-IP packets. This is helpful when the non-IP
      packets are all control packets and should be handled by routers outside
      the tunnel as having Internet Control precedence. One example of this is
      the NHRP packets that control a DMVPN-compatible mGRE tunnel; they are
      encapsulated directly by GRE and do not contain an inner IP header.
      
      Under the existing behavior, the IFLA_GRE_TOS parameter must be set to
      '1' for the TOS value to be inherited. Now, only the least significant
      bit of this parameter must be set to '1', and when a non-IP packet is
      sent through the tunnel, the upper 6 bits of this same parameter will be
      copied into the TOS field. (The ECN bits get masked off as before.)
      
      This behavior is backwards-compatible with existing configurations and
      iproute2 versions.
      Signed-off-by: NDavid Ward <david.ward@ll.mit.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      040468a0
    • J
      ipv4: introduce address lifetime · 5c766d64
      Jiri Pirko 提交于
      There are some usecase when lifetime of ipv4 addresses might be helpful.
      For example:
      1) initramfs networkmanager uses a DHCP daemon to learn network
      configuration parameters
      2) initramfs networkmanager addresses, routes and DNS configuration
      3) initramfs networkmanager is requested to stop
      4) initramfs networkmanager stops all daemons including dhclient
      5) there are addresses and routes configured but no daemon running. If
      the system doesn't start networkmanager for some reason, addresses and
      routes will be used forever, which violates RFC 2131.
      
      This patch is essentially a backport of ivp6 address lifetime mechanism
      for ipv4 addresses.
      
      Current "ip" tool supports this without any patch (since it does not
      distinguish between ipv4 and ipv6 addresses in this perspective.
      
      Also, this should be back-compatible with all current netlink users.
      Reported-by: NPavel Šimerda <psimerda@redhat.com>
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c766d64
    • J
      net: frag, move LRU list maintenance outside of rwlock · 3ef0eb0d
      Jesper Dangaard Brouer 提交于
      Updating the fragmentation queues LRU (Least-Recently-Used) list,
      required taking the hash writer lock.  However, the LRU list isn't
      tied to the hash at all, so we can use a separate lock for it.
      Original-idea-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ef0eb0d
    • J
      net: use lib/percpu_counter API for fragmentation mem accounting · 6d7b857d
      Jesper Dangaard Brouer 提交于
      Replace the per network namespace shared atomic "mem" accounting
      variable, in the fragmentation code, with a lib/percpu_counter.
      
      Getting percpu_counter to scale to the fragmentation code usage
      requires some tweaks.
      
      At first view, percpu_counter looks superfast, but it does not
      scale on multi-CPU/NUMA machines, because the default batch size
      is too small, for frag code usage.  Thus, I have adjusted the
      batch size by using __percpu_counter_add() directly, instead of
      percpu_counter_sub() and percpu_counter_add().
      
      The batch size is increased to 130.000, based on the largest 64K
      fragment memory usage.  This does introduce some imprecise
      memory accounting, but its does not need to be strict for this
      use-case.
      
      It is also essential, that the percpu_counter, does not
      share cacheline with other writers, to make this scale.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d7b857d
    • J
      net: frag helper functions for mem limit tracking · d433673e
      Jesper Dangaard Brouer 提交于
      This change is primarily a preparation to ease the extension of memory
      limit tracking.
      
      The change does reduce the number atomic operation, during freeing of
      a frag queue.  This does introduce a some performance improvement, as
      these atomic operations are at the core of the performance problems
      seen on NUMA systems.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d433673e
  9. 29 1月, 2013 1 次提交
  10. 28 1月, 2013 2 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
    • P
      IP_GRE: Fix kernel panic in IP_GRE with GRE csum. · 5465740a
      Pravin B Shelar 提交于
      Due to IP_GRE GSO support, GRE can recieve non linear skb which
      results in panic in case of GRE_CSUM.  Following patch fixes it by
      using correct csum API.
      
      Bug introduced in commit 6b78f16e (gre: add GSO support)
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5465740a
  11. 27 1月, 2013 1 次提交
  12. 24 1月, 2013 2 次提交
    • T
      soreuseport: UDP/IPv4 implementation · ba418fa3
      Tom Herbert 提交于
      Allow multiple UDP sockets to bind to the same port.
      
      Motivation soreuseport would be something like a DNS server.  An
      alternative would be to recv on the same socket from multiple threads.
      As in the case of TCP, the load across these threads tends to be
      disproportionate and we also see a lot of contection on the socketlock.
      Note that SO_REUSEADDR already allows multiple UDP sockets to bind to
      the same port, however there is no provision to prevent hijacking and
      nothing to distribute packets across all the sockets sharing the same
      bound port.  This patch does not change the semantics of SO_REUSEADDR,
      but provides usable functionality of it for unicast.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba418fa3
    • T
      soreuseport: TCP/IPv4 implementation · da5e3630
      Tom Herbert 提交于
      Allow multiple listener sockets to bind to the same port.
      
      Motivation for soresuseport would be something like a web server
      binding to port 80 running with multiple threads, where each thread
      might have it's own listener socket.  This could be done as an
      alternative to other models: 1) have one listener thread which
      dispatches completed connections to workers. 2) accept on a single
      listener socket from multiple threads.  In case #1 the listener thread
      can easily become the bottleneck with high connection turn-over rate.
      In case #2, the proportion of connections accepted per thread tends
      to be uneven under high connection load (assuming simple event loop:
      while (1) { accept(); process() }, wakeup does not promote fairness
      among the sockets.  We have seen the  disproportion to be as high
      as 3:1 ratio between thread accepting most connections and the one
      accepting the fewest.  With so_reusport the distribution is
      uniform.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da5e3630
  13. 23 1月, 2013 6 次提交
  14. 22 1月, 2013 3 次提交
  15. 21 1月, 2013 2 次提交
  16. 18 1月, 2013 1 次提交
    • J
      net: increase fragment memory usage limits · c2a93660
      Jesper Dangaard Brouer 提交于
      Increase the amount of memory usage limits for incomplete
      IP fragments.
      
      Arguing for new thresh high/low values:
      
       High threshold = 4 MBytes
       Low  threshold = 3 MBytes
      
      The fragmentation memory accounting code, tries to account for the
      real memory usage, by measuring both the size of frag queue struct
      (inet_frag_queue (ipv4:ipq/ipv6:frag_queue)) and the SKB's truesize.
      
      We want to be able to handle/hold-on-to enough fragments, to ensure
      good performance, without causing incomplete fragments to hurt
      scalability, by causing the number of inet_frag_queue to grow too much
      (resulting longer searches for frag queues).
      
      For IPv4, how much memory does the largest frag consume.
      
      Maximum size fragment is 64K, which is approx 44 fragments with
      MTU(1500) sized packets. Sizeof(struct ipq) is 200.  A 1500 byte
      packet results in a truesize of 2944 (not 2048 as I first assumed)
      
        (44*2944)+200 = 129736 bytes
      
      The current default high thresh of 262144 bytes, is obviously
      problematic, as only two 64K fragments can fit in the queue at the
      same time.
      
      How many 64K fragment can we fit into 4 MBytes:
      
        4*2^20/((44*2944)+200) = 32.34 fragment in queues
      
      An attacker could send a separate/distinct fake fragment packets per
      queue, causing us to allocate one inet_frag_queue per packet, and thus
      attacking the hash table and its lists.
      
      How many frag queue do we need to store, and given a current hash size
      of 64, what is the average list length.
      
      Using one MTU sized fragment per inet_frag_queue, each consuming
      (2944+200) 3144 bytes.
      
        4*2^20/(2944+200) = 1334 frag queues -> 21 avg list length
      
      An attack could send small fragments, the smallest packet I could send
      resulted in a truesize of 896 bytes (I'm a little surprised by this).
      
        4*2^20/(896+200)  = 3827 frag queues -> 59 avg list length
      
      When increasing these number, we also need to followup with
      improvements, that is going to help scalability.  Simply increasing
      the hash size, is not enough as the current implementation does not
      have a per hash bucket locking.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2a93660
  17. 17 1月, 2013 2 次提交
  18. 12 1月, 2013 1 次提交
  19. 11 1月, 2013 2 次提交