1. 19 10月, 2009 1 次提交
    • E
      inet: rename some inet_sock fields · c720c7e8
      Eric Dumazet 提交于
      In order to have better cache layouts of struct sock (separate zones
      for rx/tx paths), we need this preliminary patch.
      
      Goal is to transfert fields used at lookup time in the first
      read-mostly cache line (inside struct sock_common) and move sk_refcnt
      to a separate cache line (only written by rx path)
      
      This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
      sport and id fields. This allows a future patch to define these
      fields as macros, like sk_refcnt, without name clashes.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c720c7e8
  2. 02 10月, 2009 1 次提交
  3. 03 9月, 2009 1 次提交
    • E
      ip: Report qdisc packet drops · 6ce9e7b5
      Eric Dumazet 提交于
      Christoph Lameter pointed out that packet drops at qdisc level where not
      accounted in SNMP counters. Only if application sets IP_RECVERR, drops
      are reported to user (-ENOBUFS errors) and SNMP counters updated.
      
      IP_RECVERR is used to enable extended reliable error message passing,
      but these are not needed to update system wide SNMP stats.
      
      This patch changes things a bit to allow SNMP counters to be updated,
      regardless of IP_RECVERR being set or not on the socket.
      
      Example after an UDP tx flood
      # netstat -s 
      ...
      IP:
          1487048 outgoing packets dropped
      ...
      Udp:
      ...
          SndbufErrors: 1487048
      
      
      send() syscalls, do however still return an OK status, to not
      break applications.
      
      Note : send() manual page explicitly says for -ENOBUFS error :
      
       "The output queue for a network interface was full.
        This generally indicates that the interface has stopped sending,
        but may be caused by transient congestion.
        (Normally, this does not occur in Linux. Packets are just silently
        dropped when a device queue overflows.) "
      
      This is not true for IP_RECVERR enabled sockets : a send() syscall
      that hit a qdisc drop returns an ENOBUFS error.
      
      Many thanks to Christoph, David, and last but not least, Alexey !
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ce9e7b5
  4. 28 8月, 2009 1 次提交
  5. 12 7月, 2009 1 次提交
  6. 11 6月, 2009 1 次提交
    • E
      net: No more expensive sock_hold()/sock_put() on each tx · 2b85a34e
      Eric Dumazet 提交于
      One of the problem with sock memory accounting is it uses
      a pair of sock_hold()/sock_put() for each transmitted packet.
      
      This slows down bidirectional flows because the receive path
      also needs to take a refcount on socket and might use a different
      cpu than transmit path or transmit completion path. So these
      two atomic operations also trigger cache line bounces.
      
      We can see this in tx or tx/rx workloads (media gateways for example),
      where sock_wfree() can be in top five functions in profiles.
      
      We use this sock_hold()/sock_put() so that sock freeing
      is delayed until all tx packets are completed.
      
      As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
      by one unit at init time, until sk_free() is called.
      Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
      to decrement initial offset and atomicaly check if any packets
      are in flight.
      
      skb_set_owner_w() doesnt call sock_hold() anymore
      
      sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
      reached 0 to perform the final freeing.
      
      Drawback is that a skb->truesize error could lead to unfreeable sockets, or
      even worse, prematurely calling __sk_free() on a live socket.
      
      Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
      on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
      contention point. 5 % speedup on a UDP transmit workload (depends
      on number of flows), lowering TX completion cpu usage.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b85a34e
  7. 09 6月, 2009 1 次提交
  8. 03 6月, 2009 2 次提交
  9. 27 4月, 2009 1 次提交
  10. 16 2月, 2009 1 次提交
  11. 25 11月, 2008 2 次提交
    • E
      net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() · a21bba94
      Eric Dumazet 提交于
      We can reduce pressure on dst entry refcount that slowdown UDP transmit
      path on SMP machines. This pressure is visible on RTP servers when
      delivering content to mediagateways, especially big ones, handling
      thousand of streams. Several cpus send UDP frames to the same
      destination, hence use the same dst entry.
      
      This patch makes ip_push_pending_frames() steal the refcount its
      callers had to take when filling inet->cork.dst.
      
      This doesnt avoid all refcounting, but still gives speedups on SMP,
      on UDP/RAW transmit path.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a21bba94
    • E
      net: avoid a pair of dst_hold()/dst_release() in ip_append_data() · 2e77d89b
      Eric Dumazet 提交于
      We can reduce pressure on dst entry refcount that slowdown UDP transmit
      path on SMP machines. This pressure is visible on RTP servers when
      delivering content to mediagateways, especially big ones, handling
      thousand of streams. Several cpus send UDP frames to the same
      destination, hence use the same dst entry.
      
      This patch makes ip_append_data() eventually steal the refcount its
      callers had to take on the dst entry.
      
      This doesnt avoid all refcounting, but still gives speedups on SMP,
      on UDP/RAW transmit path
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e77d89b
  12. 03 11月, 2008 1 次提交
  13. 01 10月, 2008 1 次提交
  14. 26 7月, 2008 1 次提交
  15. 17 7月, 2008 1 次提交
  16. 15 7月, 2008 1 次提交
  17. 12 6月, 2008 1 次提交
  18. 30 4月, 2008 1 次提交
    • K
      [IPv4] UFO: prevent generation of chained skb destined to UFO device · be9164e7
      Kostya B 提交于
      Problem: ip_append_data() could wrongly generate a chained skb for
      devices which support UFO.  When sk_write_queue is not empty
      (e.g. MSG_MORE), __instead__ of appending data into the next nr_frag
      of the queued skb, a new chained skb is created.
      
      I would normally assume UFO device should get data in nr_frags and not
      in frag_list.  Later the udp4_hwcsum_outgoing() resets csum to NONE
      and skb_gso_segment() has oops.
      
      Proposal:
      1. Even length is less than mtu, employ ip_ufo_append_data()
      and append data to the __existed__ skb in the sk_write_queue.
      
      2. ip_ufo_append_data() is fixed due to a wrong manipulation of
      peek-ing and later enqueue-ing of the same skb.  Now, enqueuing is
      always performed, because on error the further
      ip_flush_pending_frames() would release the queued skb.
      Signed-off-by: NKostya B <bkostya@hotmail.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be9164e7
  19. 26 3月, 2008 1 次提交
  20. 25 3月, 2008 2 次提交
  21. 06 3月, 2008 1 次提交
  22. 01 2月, 2008 2 次提交
  23. 29 1月, 2008 6 次提交
  24. 23 1月, 2008 2 次提交
  25. 07 11月, 2007 1 次提交
  26. 24 10月, 2007 1 次提交
  27. 16 10月, 2007 1 次提交
    • P
      [IPV4]: Uninline netfilter okfns · 861d0486
      Patrick McHardy 提交于
      Now that we don't pass double skb pointers to nf_hook_slow anymore, gcc
      can generate tail calls for some of the netfilter hook okfn invocations,
      so there is no need to inline the functions anymore. This caused huge
      code bloat since we ended up with one inlined version and one out-of-line
      version since we pass the address to nf_hook_slow.
      
      Before:
         text    data     bss     dec     hex filename
      8997385 1016524  524652 10538561         a0ce41 vmlinux
      
      After:
         text    data     bss     dec     hex filename
      8994009 1016524  524652 10535185         a0c111 vmlinux
      -------------------------------------------------------
        -3376
      
      All cases have been verified to generate tail-calls with and without
      netfilter. The okfns in ipmr and xfrm4_input still remain inline because
      gcc can't generate tail-calls for them.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      861d0486
  28. 11 10月, 2007 2 次提交
    • S
      [NET]: Move hardware header operations out of netdevice. · 3b04ddde
      Stephen Hemminger 提交于
      Since hardware header operations are part of the protocol class
      not the device instance, make them into a separate object and
      save memory.
      Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b04ddde
    • D
      [IPV4]: Add ICMPMsgStats MIB (RFC 4293) · 96793b48
      David L Stevens 提交于
      Background: RFC 4293 deprecates existing individual, named ICMP
      type counters to be replaced with the ICMPMsgStatsTable. This table
      includes entries for both IPv4 and IPv6, and requires counting of all
      ICMP types, whether or not the machine implements the type.
      
      These patches "remove" (but not really) the existing counters, and
      replace them with the ICMPMsgStats tables for v4 and v6.
      It includes the named counters in the /proc places they were, but gets the
      values for them from the new tables. It also counts packets generated
      from raw socket output (e.g., OutEchoes, MLD queries, RA's from
      radvd, etc).
      
      Changes:
      1) create icmpmsg_statistics mib
      2) create icmpv6msg_statistics mib
      3) modify existing counters to use these
      4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
              listed by number for easy SNMP parsing
      5) modify /proc/net/snmp printing for "Icmp" to get the named data
              from new counters.
      Signed-off-by: NDavid L Stevens <dlstevens@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96793b48
  29. 14 8月, 2007 1 次提交