1. 13 5月, 2010 1 次提交
  2. 12 5月, 2010 5 次提交
  3. 08 5月, 2010 1 次提交
  4. 02 5月, 2010 2 次提交
  5. 29 4月, 2010 3 次提交
    • E
      net: ip_queue_rcv_skb() helper · f84af32c
      Eric Dumazet 提交于
      When queueing a skb to socket, we can immediately release its dst if
      target socket do not use IP_CMSG_PKTINFO.
      
      tcp_data_queue() can drop dst too.
      
      This to benefit from a hot cache line and avoid the receiver, possibly
      on another cpu, to dirty this cache line himself.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84af32c
    • E
      net: speedup udp receive path · 4b0b72f7
      Eric Dumazet 提交于
      Since commit 95766fff ([UDP]: Add memory accounting.), 
      each received packet needs one extra sock_lock()/sock_release() pair.
      
      This added latency because of possible backlog handling. Then later,
      ticket spinlocks added yet another latency source in case of DDOS.
      
      This patch introduces lock_sock_bh() and unlock_sock_bh()
      synchronization primitives, avoiding one atomic operation and backlog
      processing.
      
      skb_free_datagram_locked() uses them instead of full blown
      lock_sock()/release_sock(). skb is orphaned inside locked section for
      proper socket memory reclaim, and finally freed outside of it.
      
      UDP receive path now take the socket spinlock only once.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b0b72f7
    • D
      Revert "tcp: bind() fix when many ports are bound" · 8d238b25
      David S. Miller 提交于
      This reverts two commits:
      
      fda48a0d
      tcp: bind() fix when many ports are bound
      
      and a follow-on fix for it:
      
      6443bb1f
      ipv6: Fix inet6_csk_bind_conflict()
      
      It causes problems with binding listening sockets when time-wait
      sockets from a previous instance still are alive.
      
      It's too late to keep fiddling with this so late in the -rc
      series, and we'll deal with it in net-next-2.6 instead.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d238b25
  6. 28 4月, 2010 3 次提交
  7. 26 4月, 2010 3 次提交
    • P
      net: ipmr: add support for dumping routing tables over netlink · cb6a4e46
      Patrick McHardy 提交于
      The ipmr /proc interface (ip_mr_cache) can't be extended to dump routes
      from any tables but the main table in a backwards compatible fashion since
      the output format ends in a variable amount of output interfaces.
      
      Introduce a new netlink interface to dump multicast routes from all tables,
      similar to the netlink interface for regular routes.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      cb6a4e46
    • P
      net: rtnetlink: decouple rtnetlink address families from real address families · 25239cee
      Patrick McHardy 提交于
      Decouple rtnetlink address families from real address families in socket.h to
      be able to add rtnetlink interfaces to code that is not a real address family
      without increasing AF_MAX/NPROTO.
      
      This will be used to add support for multicast route dumping from all tables
      as the proc interface can't be extended to support anything but the main table
      without breaking compatibility.
      
      This partialy undoes the patch to introduce independant families for routing
      rules and converts ipmr routing rules to a new rtnetlink family. Similar to
      that patch, values up to 127 are reserved for real address families, values
      above that may be used arbitrarily.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      25239cee
    • P
      net: fib_rules: mark arguments to fib_rules_register const and __net_initdata · 3d0c9c4e
      Patrick McHardy 提交于
      fib_rules_register() duplicates the template passed to it without modification,
      mark the argument as const. Additionally the templates are only needed when
      instantiating a new namespace, so mark them as __net_initdata, which means
      they can be discarded when CONFIG_NET_NS=n.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      3d0c9c4e
  8. 23 4月, 2010 3 次提交
  9. 22 4月, 2010 1 次提交
  10. 21 4月, 2010 2 次提交
  11. 20 4月, 2010 1 次提交
  12. 19 4月, 2010 4 次提交
  13. 17 4月, 2010 1 次提交
    • T
      rfs: Receive Flow Steering · fec5e652
      Tom Herbert 提交于
      This patch implements receive flow steering (RFS).  RFS steers
      received packets for layer 3 and 4 processing to the CPU where
      the application for the corresponding flow is running.  RFS is an
      extension of Receive Packet Steering (RPS).
      
      The basic idea of RFS is that when an application calls recvmsg
      (or sendmsg) the application's running CPU is stored in a hash
      table that is indexed by the connection's rxhash which is stored in
      the socket structure.  The rxhash is passed in skb's received on
      the connection from netif_receive_skb.  For each received packet,
      the associated rxhash is used to look up the CPU in the hash table,
      if a valid CPU is set then the packet is steered to that CPU using
      the RPS mechanisms.
      
      The convolution of the simple approach is that it would potentially
      allow OOO packets.  If threads are thrashing around CPUs or multiple
      threads are trying to read from the same sockets, a quickly changing
      CPU value in the hash table could cause rampant OOO packets--
      we consider this a non-starter.
      
      To avoid OOO packets, this solution implements two types of hash
      tables: rps_sock_flow_table and rps_dev_flow_table.
      
      rps_sock_table is a global hash table.  Each entry is just a CPU
      number and it is populated in recvmsg and sendmsg as described above.
      This table contains the "desired" CPUs for flows.
      
      rps_dev_flow_table is specific to each device queue.  Each entry
      contains a CPU and a tail queue counter.  The CPU is the "current"
      CPU for a matching flow.  The tail queue counter holds the value
      of a tail queue counter for the associated CPU's backlog queue at
      the time of last enqueue for a flow matching the entry.
      
      Each backlog queue has a queue head counter which is incremented
      on dequeue, and so a queue tail counter is computed as queue head
      count + queue length.  When a packet is enqueued on a backlog queue,
      the current value of the queue tail counter is saved in the hash
      entry of the rps_dev_flow_table.
      
      And now the trick: when selecting the CPU for RPS (get_rps_cpu)
      the rps_sock_flow table and the rps_dev_flow table for the RX queue
      are consulted.  When the desired CPU for the flow (found in the
      rps_sock_flow table) does not match the current CPU (found in the
      rps_dev_flow table), the current CPU is changed to the desired CPU
      if one of the following is true:
      
      - The current CPU is unset (equal to RPS_NO_CPU)
      - Current CPU is offline
      - The current CPU's queue head counter >= queue tail counter in the
      rps_dev_flow table.  This checks if the queue tail has advanced
      beyond the last packet that was enqueued using this table entry.
      This guarantees that all packets queued using this entry have been
      dequeued, thus preserving in order delivery.
      
      Making each queue have its own rps_dev_flow table has two advantages:
      1) the tail queue counters will be written on each receive, so
      keeping the table local to interrupting CPU s good for locality.  2)
      this allows lockless access to the table-- the CPU number and queue
      tail counter need to be accessed together under mutual exclusion
      from netif_receive_skb, we assume that this is only called from
      device napi_poll which is non-reentrant.
      
      This patch implements RFS for TCP and connected UDP sockets.
      It should be usable for other flow oriented protocols.
      
      There are two configuration parameters for RFS.  The
      "rps_flow_entries" kernel init parameter sets the number of
      entries in the rps_sock_flow_table, the per rxqueue sysfs entry
      "rps_flow_cnt" contains the number of entries in the rps_dev_flow
      table for the rxqueue.  Both are rounded to power of two.
      
      The obvious benefit of RFS (over just RPS) is that it achieves
      CPU locality between the receive processing for a flow and the
      applications processing; this can result in increased performance
      (higher pps, lower latency).
      
      The benefits of RFS are dependent on cache hierarchy, application
      load, and other factors.  On simple benchmarks, we don't necessarily
      see improvement and sometimes see degradation.  However, for more
      complex benchmarks and for applications where cache pressure is
      much higher this technique seems to perform very well.
      
      Below are some benchmark results which show the potential benfit of
      this patch.  The netperf test has 500 instances of netperf TCP_RR
      test with 1 byte req. and resp.  The RPC test is an request/response
      test similar in structure to netperf RR test ith 100 threads on
      each host, but does more work in userspace that netperf.
      
      e1000e on 8 core Intel
         No RFS or RPS		104K tps at 30% CPU
         No RFS (best RPS config):    290K tps at 63% CPU
         RFS				303K tps at 61% CPU
      
      RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
        No RFS/RPS	103K	48%	757/900/3185		4472.35
        RPS only:	174K	73%	415/993/2468		491.66
        RFS		223K	73%	379/651/1382		315.61
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fec5e652
  14. 16 4月, 2010 3 次提交
    • S
      net: replace ipfragok with skb->local_df · 4e15ed4d
      Shan Wei 提交于
      As Herbert Xu said: we should be able to simply replace ipfragok
      with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
      has droped ipfragok and set local_df value properly.
      
      The patch kills the ipfragok parameter of .queue_xmit().
      Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e15ed4d
    • E
      ip: Fix ip_dev_loopback_xmit() · e30b38c2
      Eric Dumazet 提交于
      Eric Paris got following trace with a linux-next kernel
      
      [   14.203970] BUG: using smp_processor_id() in preemptible [00000000]
      code: avahi-daemon/2093
      [   14.204025] caller is netif_rx+0xfa/0x110
      [   14.204035] Call Trace:
      [   14.204064]  [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
      [   14.204070]  [<ffffffff8142163a>] netif_rx+0xfa/0x110
      [   14.204090]  [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
      [   14.204095]  [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
      [   14.204099]  [<ffffffff8145d610>] ip_local_out+0x20/0x30
      [   14.204105]  [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
      [   14.204119]  [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
      [   14.204125]  [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
      [   14.204137]  [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
      [   14.204149]  [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
      [   14.204189]  [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
      [   14.204233]  [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b
      
      While current linux-2.6 kernel doesnt emit this warning, bug is latent
      and might cause unexpected failures.
      
      ip_dev_loopback_xmit() runs in process context, preemption enabled, so
      must call netif_rx_ni() instead of netif_rx(), to make sure that we
      process pending software interrupt.
      
      Same change for ip6_dev_loopback_xmit()
      Reported-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e30b38c2
    • P
      netfilter: ipt_LOG/ip6t_LOG: use more appropriate log level as default · f0d57a54
      Patrick McHardy 提交于
      Use KERN_NOTICE instead of KERN_EMERG by default. This only affects
      kernel internal logging (like conntrack), user-specified logging rules
      contain a seperate log level.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      f0d57a54
  15. 15 4月, 2010 4 次提交
  16. 14 4月, 2010 3 次提交