1. 15 11月, 2011 2 次提交
    • E
      bnx2x: uses build_skb() in receive path · e52fcb24
      Eric Dumazet 提交于
      bnx2x uses following formula to compute its rx_buf_sz :
      
      dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
      
      Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
      skb_shared_info))
      
      Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
      MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
      
      Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
      false sharing because of mem_reclaim in UDP stack.
      
      One possible way to half truesize is to reduce the need by 64 bytes
      (2112 -> 2048 bytes)
      
      Instead of allocating a full cache line at the end of packet for
      alignment, we can use the fact that skb_shared_info sits at the end of
      skb->head, and we can use this room, if we convert bnx2x to new
      build_skb() infrastructure.
      
      skb_shared_info will be initialized after hardware finished its
      transfert, so we can eventually overwrite the final padding.
      
      Using build_skb() also reduces cache line misses in the driver, since we
      use cache hot skb instead of cold ones. Number of in-flight sk_buff
      structures is lower, they are recycled while still hot.
      
      Performance results :
      
      (820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Eilon Greenstein <eilong@broadcom.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Jamal Hadi Salim <hadi@mojatatu.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Thomas Graf <tgraf@infradead.org>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEilon Greenstein <eilong@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e52fcb24
    • E
      net: introduce build_skb() · b2b5ce9d
      Eric Dumazet 提交于
      One of the thing we discussed during netdev 2011 conference was the idea
      to change some network drivers to allocate/populate their skb at RX
      completion time, right before feeding the skb to network stack.
      
      In old days, we allocated skbs when populating the RX ring.
      
      This means bringing into cpu cache sk_buff and skb_shared_info cache
      lines (since we clear/initialize them), then 'queue' skb->data to NIC.
      
      By the time NIC fills a frame in skb->data buffer and host can process
      it, cpu probably threw away the cache lines from its caches, because lot
      of things happened between the allocation and final use.
      
      So the deal would be to allocate only the data buffer for the NIC to
      populate its RX ring buffer. And use build_skb() at RX completion to
      attach a data buffer (now filled with an ethernet frame) to a new skb,
      initialize the skb_shared_info portion, and give the hot skb to network
      stack.
      
      build_skb() is the function to allocate an skb, caller providing the
      data buffer that should be attached to it. Drivers are expected to call
      skb_reserve() right after build_skb() to adjust skb->data to the
      Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
      drivers might add a hardware provided alignment)
      
      Data provided to build_skb() MUST have been allocated by a prior
      kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
      skb_shared_info)) bytes at the end of the data without corrupting
      incoming frame.
      
      data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
                     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
      	       GFP_ATOMIC);
      ...
      skb = build_skb(data);
      if (!skb) {
      	recycle_data(data);
      } else {
      	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
      	...
      }
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Eilon Greenstein <eilong@broadcom.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Jamal Hadi Salim <hadi@mojatatu.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Thomas Graf <tgraf@infradead.org>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b5ce9d
  2. 14 11月, 2011 27 次提交
  3. 13 11月, 2011 4 次提交
  4. 10 11月, 2011 2 次提交
    • E
      ipv4: PKTINFO doesnt need dst reference · d826eb14
      Eric Dumazet 提交于
      Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
      
      > At least, in recent kernels we dont change dst->refcnt in forwarding
      > patch (usinf NOREF skb->dst)
      >
      > One particular point is the atomic_inc(dst->refcnt) we have to perform
      > when queuing an UDP packet if socket asked PKTINFO stuff (for example a
      > typical DNS server has to setup this option)
      >
      > I have one patch somewhere that stores the information in skb->cb[] and
      > avoid the atomic_{inc|dec}(dst->refcnt).
      >
      
      OK I found it, I did some extra tests and believe its ready.
      
      [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
      
      When a socket uses IP_PKTINFO notifications, we currently force a dst
      reference for each received skb. Reader has to access dst to get needed
      information (rt_iif & rt_spec_dst) and must release dst reference.
      
      We also forced a dst reference if skb was put in socket backlog, even
      without IP_PKTINFO handling. This happens under stress/load.
      
      We can instead store the needed information in skb->cb[], so that only
      softirq handler really access dst, improving cache hit ratios.
      
      This removes two atomic operations per packet, and false sharing as
      well.
      
      On a benchmark using a mono threaded receiver (doing only recvmsg()
      calls), I can reach 720.000 pps instead of 570.000 pps.
      
      IP_PKTINFO is typically used by DNS servers, and any multihomed aware
      UDP application.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d826eb14
    • E
      ipv4: reduce percpu needs for icmpmsg mibs · acb32ba3
      Eric Dumazet 提交于
      Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
      (can be ~88000 us).
      
      This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
      for all possible cpus can read 16 Mbytes of memory.
      
      ICMP messages are not considered as fast path on a typical server, and
      eventually few cpus handle them anyway. We can afford an atomic
      operation instead of using percpu data.
      
      This saves 4096 bytes per cpu and per network namespace.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb32ba3
  5. 09 11月, 2011 5 次提交