1. 08 8月, 2011 1 次提交
  2. 07 5月, 2011 1 次提交
  3. 29 4月, 2011 1 次提交
    • E
      inet: add RCU protection to inet->opt · f6d8bd05
      Eric Dumazet 提交于
      We lack proper synchronization to manipulate inet->opt ip_options
      
      Problem is ip_make_skb() calls ip_setup_cork() and
      ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
      without any protection against another thread manipulating inet->opt.
      
      Another thread can change inet->opt pointer and free old one under us.
      
      Use RCU to protect inet->opt (changed to inet->inet_opt).
      
      Instead of handling atomic refcounts, just copy ip_options when
      necessary, to avoid cache line dirtying.
      
      We cant insert an rcu_head in struct ip_options since its included in
      skb->cb[], so this patch is large because I had to introduce a new
      ip_options_rcu structure.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6d8bd05
  4. 02 3月, 2011 1 次提交
  5. 28 1月, 2011 1 次提交
  6. 10 12月, 2010 1 次提交
    • E
      net: optimize INET input path further · 68835aba
      Eric Dumazet 提交于
      Followup of commit b178bb3d (net: reorder struct sock fields)
      
      Optimize INET input path a bit further, by :
      
      1) moving sk_refcnt close to sk_lock.
      
      This reduces number of dirtied cache lines by one on 64bit arches (and
      64 bytes cache line size).
      
      2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
      
      (same cache line than hash / family / bound_dev_if / nulls_node)
      
      This reduces number of accessed cache lines in lookups by one, and dont
      increase size of inet and timewait socks.
      inet and tw sockets now share same place-holder for these fields.
      
      Before patch :
      
      offsetof(struct sock, sk_refcnt) = 0x10
      offsetof(struct sock, sk_lock) = 0x40
      offsetof(struct sock, sk_receive_queue) = 0x60
      offsetof(struct inet_sock, inet_daddr) = 0x270
      offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
      
      After patch :
      
      offsetof(struct sock, sk_refcnt) = 0x44
      offsetof(struct sock, sk_lock) = 0x48
      offsetof(struct sock, sk_receive_queue) = 0x68
      offsetof(struct inet_sock, inet_daddr) = 0x0
      offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
      
      compute_score() (udp or tcp) now use a single cache line per ignored
      item, instead of two.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68835aba
  7. 13 11月, 2010 1 次提交
  8. 24 6月, 2010 1 次提交
  9. 28 4月, 2010 2 次提交
  10. 17 4月, 2010 1 次提交
    • T
      rfs: Receive Flow Steering · fec5e652
      Tom Herbert 提交于
      This patch implements receive flow steering (RFS).  RFS steers
      received packets for layer 3 and 4 processing to the CPU where
      the application for the corresponding flow is running.  RFS is an
      extension of Receive Packet Steering (RPS).
      
      The basic idea of RFS is that when an application calls recvmsg
      (or sendmsg) the application's running CPU is stored in a hash
      table that is indexed by the connection's rxhash which is stored in
      the socket structure.  The rxhash is passed in skb's received on
      the connection from netif_receive_skb.  For each received packet,
      the associated rxhash is used to look up the CPU in the hash table,
      if a valid CPU is set then the packet is steered to that CPU using
      the RPS mechanisms.
      
      The convolution of the simple approach is that it would potentially
      allow OOO packets.  If threads are thrashing around CPUs or multiple
      threads are trying to read from the same sockets, a quickly changing
      CPU value in the hash table could cause rampant OOO packets--
      we consider this a non-starter.
      
      To avoid OOO packets, this solution implements two types of hash
      tables: rps_sock_flow_table and rps_dev_flow_table.
      
      rps_sock_table is a global hash table.  Each entry is just a CPU
      number and it is populated in recvmsg and sendmsg as described above.
      This table contains the "desired" CPUs for flows.
      
      rps_dev_flow_table is specific to each device queue.  Each entry
      contains a CPU and a tail queue counter.  The CPU is the "current"
      CPU for a matching flow.  The tail queue counter holds the value
      of a tail queue counter for the associated CPU's backlog queue at
      the time of last enqueue for a flow matching the entry.
      
      Each backlog queue has a queue head counter which is incremented
      on dequeue, and so a queue tail counter is computed as queue head
      count + queue length.  When a packet is enqueued on a backlog queue,
      the current value of the queue tail counter is saved in the hash
      entry of the rps_dev_flow_table.
      
      And now the trick: when selecting the CPU for RPS (get_rps_cpu)
      the rps_sock_flow table and the rps_dev_flow table for the RX queue
      are consulted.  When the desired CPU for the flow (found in the
      rps_sock_flow table) does not match the current CPU (found in the
      rps_dev_flow table), the current CPU is changed to the desired CPU
      if one of the following is true:
      
      - The current CPU is unset (equal to RPS_NO_CPU)
      - Current CPU is offline
      - The current CPU's queue head counter >= queue tail counter in the
      rps_dev_flow table.  This checks if the queue tail has advanced
      beyond the last packet that was enqueued using this table entry.
      This guarantees that all packets queued using this entry have been
      dequeued, thus preserving in order delivery.
      
      Making each queue have its own rps_dev_flow table has two advantages:
      1) the tail queue counters will be written on each receive, so
      keeping the table local to interrupting CPU s good for locality.  2)
      this allows lockless access to the table-- the CPU number and queue
      tail counter need to be accessed together under mutual exclusion
      from netif_receive_skb, we assume that this is only called from
      device napi_poll which is non-reentrant.
      
      This patch implements RFS for TCP and connected UDP sockets.
      It should be usable for other flow oriented protocols.
      
      There are two configuration parameters for RFS.  The
      "rps_flow_entries" kernel init parameter sets the number of
      entries in the rps_sock_flow_table, the per rxqueue sysfs entry
      "rps_flow_cnt" contains the number of entries in the rps_dev_flow
      table for the rxqueue.  Both are rounded to power of two.
      
      The obvious benefit of RFS (over just RPS) is that it achieves
      CPU locality between the receive processing for a flow and the
      applications processing; this can result in increased performance
      (higher pps, lower latency).
      
      The benefits of RFS are dependent on cache hierarchy, application
      load, and other factors.  On simple benchmarks, we don't necessarily
      see improvement and sometimes see degradation.  However, for more
      complex benchmarks and for applications where cache pressure is
      much higher this technique seems to perform very well.
      
      Below are some benchmark results which show the potential benfit of
      this patch.  The netperf test has 500 instances of netperf TCP_RR
      test with 1 byte req. and resp.  The RPC test is an request/response
      test similar in structure to netperf RR test ith 100 threads on
      each host, but does more work in userspace that netperf.
      
      e1000e on 8 core Intel
         No RFS or RPS		104K tps at 30% CPU
         No RFS (best RPS config):    290K tps at 63% CPU
         RFS				303K tps at 61% CPU
      
      RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
        No RFS/RPS	103K	48%	757/900/3185		4472.35
        RPS only:	174K	73%	415/993/2468		491.66
        RFS		223K	73%	379/651/1382		315.61
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fec5e652
  11. 12 1月, 2010 1 次提交
  12. 19 10月, 2009 1 次提交
    • E
      inet: rename some inet_sock fields · c720c7e8
      Eric Dumazet 提交于
      In order to have better cache layouts of struct sock (separate zones
      for rx/tx paths), we need this preliminary patch.
      
      Goal is to transfert fields used at lookup time in the first
      read-mostly cache line (inside struct sock_common) and move sk_refcnt
      to a separate cache line (only written by rx path)
      
      This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
      sport and id fields. This allows a future patch to define these
      fields as macros, like sk_refcnt, without name clashes.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c720c7e8
  13. 15 6月, 2009 1 次提交
  14. 02 6月, 2009 1 次提交
    • N
      ipv4: New multicast-all socket option · f771bef9
      Nivedita Singhvi 提交于
      After some discussion offline with Christoph Lameter and David Stevens
      regarding multicast behaviour in Linux, I'm submitting a slightly
      modified patch from the one Christoph submitted earlier.
      
      This patch provides a new socket option IP_MULTICAST_ALL.
      
      In this case, default behaviour is _unchanged_ from the current
      Linux standard. The socket option is set by default to provide
      original behaviour. Sockets wishing to receive data only from
      multicast groups they join explicitly will need to clear this
      socket option.
      Signed-off-by: NNivedita Singhvi <niv@us.ibm.com>
      Signed-off-by: Christoph Lameter<cl@linux.com>
      Acked-by: NDavid Stevens <dlstevens@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f771bef9
  15. 01 10月, 2008 4 次提交
  16. 17 6月, 2008 2 次提交
  17. 11 6月, 2008 1 次提交
  18. 25 3月, 2008 1 次提交
  19. 23 3月, 2008 1 次提交
  20. 06 3月, 2008 1 次提交
  21. 05 3月, 2008 1 次提交
    • D
      [TCP]: Improve ipv4 established hash function. · 7adc3830
      David S. Miller 提交于
      If all of the entropy is in the local and foreign addresses,
      but xor'ing together would cancel out that entropy, the
      current hash performs poorly.
      
      Suggested by Cosmin Ratiu:
      
      	Basically, the situation is as follows: There is a client
      	machine and a server machine. Both create 15000 virtual
      	interfaces, open up a socket for each pair of interfaces and
      	do SIP traffic. By profiling I noticed that there is a lot of
      	time spent walking the established hash chains with this
      	particular setup.
      
      	The addresses were distributed like this: client interfaces
      	were 198.18.0.1/16 with increments of 1 and server interfaces
      	were 198.18.128.1/16 with increments of 1. As I said, there
      	were 15000 interfaces. Source and destination ports were 5060
      	for each connection.  So in this case, ports don't matter for
      	hashing purposes, and the bits from the address pairs used
      	cancel each other, meaning there are no differences in the
      	whole lot of pairs, so they all end up in the same hash chain.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7adc3830
  22. 26 10月, 2007 1 次提交
  23. 26 4月, 2007 1 次提交
  24. 29 9月, 2006 6 次提交
  25. 23 9月, 2006 2 次提交
  26. 26 4月, 2006 1 次提交
  27. 04 1月, 2006 1 次提交