1. 17 11月, 2010 10 次提交
    • E
      net: reorder struct sock fields · b178bb3d
      Eric Dumazet 提交于
      Right now, fields in struct sock are not optimally ordered, because each
      path (RX softirq, TX completion, RX user,  TX user) has to touch fields
      that are contained in many different cache lines.
      
      The really critical thing is to shrink number of cache lines that are
      used at RX softirq time : CPU handling softirqs for a device can receive
      many frames per second for many sockets. If load is too big, we can drop
      frames at NIC level. RPS or multiqueue cards can help, but better reduce
      latency if possible.
      
      This patch starts with UDP protocol, then additional patches will try to
      reduce latencies of other ones as well.
      
      At RX softirq time, fields of interest for UDP protocol are :
      (not counting ones in inet struct for the lookup)
      
      Read/Written:
      sk_refcnt   (atomic increment/decrement)
      sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
      sk_receive_queue
      sk_backlog (if socket locked by user program)
      sk_rxhash
      sk_forward_alloc
      sk_drops
      
      Read only:
      sk_rcvbuf (sk_rcvqueues_full())
      sk_filter
      sk_wq
      sk_policy[0]
      sk_flags
      
      Additional notes :
      
      - sk_backlog has one hole on 64bit arches. We can fill it to save 8
      bytes.
      - sk_backlog is used only if RX sofirq handler finds the socket while
      locked by user.
      - sk_rxhash is written only once per flow.
      - sk_drops is written only if queues are full
      
      Final layout :
      
      [1] One section grouping all read/write fields, but placing rxhash and
      sk_backlog at the end of this section.
      
      [2] One section grouping all read fields in RX handler
         (sk_filter, sk_rcv_buf, sk_wq)
      
      [3] Section used by other paths
      
      I'll post a patch on its own to put sk_refcnt at the end of struct
      sock_common so that it shares same cache line than section [1]
      
      New offsets on 64bit arch :
      
      sizeof(struct sock)=0x268
      offsetof(struct sock, sk_refcnt)  =0x10
      offsetof(struct sock, sk_lock)    =0x48
      offsetof(struct sock, sk_receive_queue)=0x68
      offsetof(struct sock, sk_backlog)=0x80
      offsetof(struct sock, sk_rmem_alloc)=0x80
      offsetof(struct sock, sk_forward_alloc)=0x98
      offsetof(struct sock, sk_rxhash)=0x9c
      offsetof(struct sock, sk_rcvbuf)=0xa4
      offsetof(struct sock, sk_drops) =0xa0
      offsetof(struct sock, sk_filter)=0xa8
      offsetof(struct sock, sk_wq)=0xb0
      offsetof(struct sock, sk_policy)=0xd0
      offsetof(struct sock, sk_flags) =0xe0
      
      Instead of :
      
      sizeof(struct sock)=0x270
      offsetof(struct sock, sk_refcnt)  =0x10
      offsetof(struct sock, sk_lock)    =0x50
      offsetof(struct sock, sk_receive_queue)=0xc0
      offsetof(struct sock, sk_backlog)=0x70
      offsetof(struct sock, sk_rmem_alloc)=0xac
      offsetof(struct sock, sk_forward_alloc)=0x10c
      offsetof(struct sock, sk_rxhash)=0x128
      offsetof(struct sock, sk_rcvbuf)=0x4c
      offsetof(struct sock, sk_drops) =0x16c
      offsetof(struct sock, sk_filter)=0x198
      offsetof(struct sock, sk_wq)=0x88
      offsetof(struct sock, sk_policy)=0x98
      offsetof(struct sock, sk_flags) =0x130
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b178bb3d
    • E
      udp: use atomic_inc_not_zero_hint · c31504dc
      Eric Dumazet 提交于
      UDP sockets refcount is usually 2, unless an incoming frame is going to
      be queued in receive or backlog queue.
      
      Using atomic_inc_not_zero_hint() permits to reduce latency, because
      processor issues less memory transactions.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c31504dc
    • E
      vlan: remove ndo_select_queue() logic · 213b15ca
      Eric Dumazet 提交于
      Now vlan are lockless, we dont need special ndo_select_queue() logic.
      dev_pick_tx() will do the multiqueue stuff on the real device transmit.
      Suggested-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      213b15ca
    • E
      vlan: lockless transmit path · 4af429d2
      Eric Dumazet 提交于
      vlan is a stacked device, like tunnels. We should use the lockless
      mechanism we are using in tunnels and loopback.
      
      This patch completely removes locking in TX path.
      
      tx stat counters are added into existing percpu stat structure, renamed
      from vlan_rx_stats to vlan_pcpu_stats.
      
      Note : this partially reverts commit 2e59af3d (vlan: multiqueue vlan
      device)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4af429d2
    • E
      macvlan: lockless tx path · 8ffab51b
      Eric Dumazet 提交于
      macvlan is a stacked device, like tunnels. We should use the lockless
      mechanism we are using in tunnels and loopback.
      
      This patch completely removes locking in TX path.
      
      tx stat counters are added into existing percpu stat structure, renamed
      from rx_stats to pcpu_stats.
      
      Note : this reverts commit 2c114553 (macvlan: add multiqueue
      capability)
      
      Note : rx_errors converted to a 32bit counter, like tx_dropped, since
      they dont need 64bit range.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ffab51b
    • N
      packet: Enhance AF_PACKET implementation to not require high order contiguous... · 0e3125c7
      Neil Horman 提交于
      packet: Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4)
      MIME-Version: 1.0
      Content-Type: text/plain; charset=UTF-8
      Content-Transfer-Encoding: 8bit
      
      Version 4 of this patch.
      
      Change notes:
      1) Removed extra memset.  Didn't think kcalloc added a GFP_ZERO the way kzalloc did :)
      
      Summary:
      It was shown to me recently that systems under high load were driven very deep
      into swap when tcpdump was run.  The reason this happened was because the
      AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
      application to specify how many entries an AF_PACKET socket will have and how
      large each entry will be.  It seems the default setting for tcpdump is to set
      the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
      allocation.  Thats difficult under good circumstances, and horrid under memory
      pressure.
      
      I thought it would be good to make that a bit more usable.  I was going to do a
      simple conversion of the ring buffer from contigous pages to iovecs, but
      unfortunately, the metadata which AF_PACKET places in these buffers can easily
      span a page boundary, and given that these buffers get mapped into user space,
      and the data layout doesn't easily allow for a change to padding between frames
      to avoid that, a simple iovec change is just going to break user space ABI
      consistency.
      
      So I've done this, I've added a three tiered mechanism to the af_packet set_ring
      socket option.  It attempts to allocate memory in the following order:
      
      1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
      digging into swap
      
      2) Using vmalloc
      
      3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
      needed to get the memory
      
      The effect is that we don't disturb the system as much when we're under load,
      while still being able to conduct tcpdumps effectively.
      
      Tested successfully by me.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Reported-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e3125c7
    • J
      drivers/isdn/mISDN: Use printf extension %pV · 020f01eb
      Joe Perches 提交于
      Using %pV reduces the number of printk calls and
      eliminates any possible message interleaving from
      other printk calls.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      020f01eb
    • J
      netlink: let nlmsg and nla functions take pointer-to-const args · 3654654f
      Jan Engelhardt 提交于
      The changed functions do not modify the NL messages and/or attributes
      at all. They should use const (similar to strchr), so that callers
      which have a const nlmsg/nlattr around can make use of them without
      casting.
      
      While at it, constify a data array.
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3654654f
    • J
      ipv6: fix missing in6_ifa_put in addrconf · 9d82ca98
      John Fastabend 提交于
      Fix ref count bug introduced by
      
      commit 2de79570
      Author: Lorenzo Colitti <lorenzo@google.com>
      Date:   Wed Oct 27 18:16:49 2010 +0000
      
      ipv6: addrconf: don't remove address state on ifdown if the address
      is being kept
      
      Fix logic so that addrconf_ifdown() decrements the inet6_ifaddr
      refcnt correctly with in6_ifa_put().
      Reported-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d82ca98
    • D
  2. 16 11月, 2010 30 次提交