1. 30 11月, 2011 4 次提交
    • E
      flow_dissector: use a 64bit load/store · 4d77d2b5
      Eric Dumazet 提交于
      Le lundi 28 novembre 2011 à 19:06 -0500, David Miller a écrit :
      > From: Dimitris Michailidis <dm@chelsio.com>
      > Date: Mon, 28 Nov 2011 08:25:39 -0800
      >
      > >> +bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys
      > >> *flow)
      > >> +{
      > >> +	int poff, nhoff = skb_network_offset(skb);
      > >> +	u8 ip_proto;
      > >> +	u16 proto = skb->protocol;
      > >
      > > __be16 instead of u16 for proto?
      >
      > I'll take care of this when I apply these patches.
      
      ( CC trimmed )
      
      Thanks David !
      
      Here is a small patch to use one 64bit load/store on x86_64 instead of
      two 32bit load/stores.
      
      [PATCH net-next] flow_dissector: use a 64bit load/store
      
      gcc compiler is smart enough to use a single load/store if we
      memcpy(dptr, sptr, 8) on x86_64, regardless of
      CONFIG_CC_OPTIMIZE_FOR_SIZE
      
      In IP header, daddr immediately follows saddr, this wont change in the
      future. We only need to make sure our flow_keys (src,dst) fields wont
      break the rule.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d77d2b5
    • T
      bql: Byte queue limits · 114cf580
      Tom Herbert 提交于
      Networking stack support for byte queue limits, uses dynamic queue
      limits library.  Byte queue limits are maintained per transmit queue,
      and a dql structure has been added to netdev_queue structure for this
      purpose.
      
      Configuration of bql is in the tx-<n> sysfs directory for the queue
      under the byte_queue_limits directory.  Configuration includes:
      limit_min, bql minimum limit
      limit_max, bql maximum limit
      hold_time, bql slack hold time
      
      Also under the directory are:
      limit, current byte limit
      inflight, current number of bytes on the queue
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      114cf580
    • T
      xps: Add xps_queue_release function · 927fbec1
      Tom Herbert 提交于
      This patch moves the xps specific parts in netdev_queue_release into
      its own function which netdev_queue_release can call.  This allows
      netdev_queue_release to be more generic (for adding new attributes
      to tx queues).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      927fbec1
    • T
      net: Add queue state xoff flag for stack · 73466498
      Tom Herbert 提交于
      Create separate queue state flags so that either the stack or drivers
      can turn on XOFF.  Added a set of functions used in the stack to determine
      if a queue is really stopped (either by stack or driver)
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73466498
  2. 29 11月, 2011 4 次提交
    • E
      net: optimize socket timestamping · 08e29af3
      Eric Dumazet 提交于
      We can test/set multiple bits from sk_flags at once, to shorten a bit
      socket setup/dismantle phase.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08e29af3
    • E
      net: dont call jump_label_dec from irq context · b90e5794
      Eric Dumazet 提交于
      Igor Maravic reported an error caused by jump_label_dec() being called
      from IRQ context :
      
       BUG: sleeping function called from invalid context at kernel/mutex.c:271
       in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
       1 lock held by swapper/0:
        #0:  (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
       Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
      Call Trace:
       <IRQ>  [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
       [<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
       [<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
       [<ffffffff8109a37f>] ? local_clock+0x6f/0x80
       [<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
       [<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
       [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
       [<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
       [<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
       [<ffffffff81566525>] net_disable_timestamp+0x15/0x20
       [<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
       [<ffffffff81557b00>] __sk_free+0x80/0x200
       [<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
       [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
       [<ffffffff81557cba>] sock_wfree+0x3a/0x70
       [<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
       [<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
       [<ffffffff8155c119>] kfree_skb+0x49/0x170
       [<ffffffff815e936e>] arp_error_report+0x3e/0x90
       [<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
       [<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
       [<ffffffff81578d20>] ? neigh_update+0x640/0x640
       [<ffffffff81073558>] __do_softirq+0xc8/0x3a0
      
      Since jump_label_{inc|dec} must be called from process context only,
      we must defer jump_label_dec() if net_disable_timestamp() is called
      from interrupt context.
      Reported-by: NIgor Maravic <igorm@etf.rs>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b90e5794
    • E
      net: use skb_flow_dissect() in __skb_get_rxhash() · 4504b861
      Eric Dumazet 提交于
      No functional changes.
      
      This uses the code we factorized in skb_flow_dissect()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4504b861
    • E
      net: introduce skb_flow_dissect() · 0744dd00
      Eric Dumazet 提交于
      We use at least two flow dissectors in network stack, with known
      limitations and code duplication.
      
      Introduce skb_flow_dissect() to factorize this, highly inspired from
      existing dissector from __skb_get_rxhash()
      
      Note : We extensively use skb_header_pointer(), this permits us to not
      touch skb at all.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0744dd00
  3. 26 11月, 2011 1 次提交
  4. 24 11月, 2011 1 次提交
  5. 23 11月, 2011 3 次提交
  6. 19 11月, 2011 1 次提交
    • H
      net: Remove all uses of LL_ALLOCATED_SPACE · ae641949
      Herbert Xu 提交于
      net: Remove all uses of LL_ALLOCATED_SPACE
      
      The macro LL_ALLOCATED_SPACE was ill-conceived.  It applies the
      alignment to the sum of needed_headroom and needed_tailroom.  As
      the amount that is then reserved for head room is needed_headroom
      with alignment, this means that the tail room left may be too small.
      
      This patch replaces all uses of LL_ALLOCATED_SPACE with the macro
      LL_RESERVED_SPACE and direct reference to needed_tailroom.
      
      This also fixes the problem with needed_headroom changing between
      allocating the skb and reserving the head room.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae641949
  7. 18 11月, 2011 1 次提交
    • E
      net: use jump_label to shortcut RPS if not setup · adc9300e
      Eric Dumazet 提交于
      Most machines dont use RPS/RFS, and pay a fair amount of instructions in
      netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
      RPS/RFS is not setup.
      
      Add a jump_label named rps_needed.
      
      If no device rps_map or global rps_sock_flow_table is setup,
      netif_receive_skb() / netif_rx() do a single instruction instead of many
      ones, including conditional jumps.
      
      jmp +0    (if CONFIG_JUMP_LABEL=y)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adc9300e
  8. 17 11月, 2011 10 次提交
  9. 15 11月, 2011 1 次提交
    • E
      net: introduce build_skb() · b2b5ce9d
      Eric Dumazet 提交于
      One of the thing we discussed during netdev 2011 conference was the idea
      to change some network drivers to allocate/populate their skb at RX
      completion time, right before feeding the skb to network stack.
      
      In old days, we allocated skbs when populating the RX ring.
      
      This means bringing into cpu cache sk_buff and skb_shared_info cache
      lines (since we clear/initialize them), then 'queue' skb->data to NIC.
      
      By the time NIC fills a frame in skb->data buffer and host can process
      it, cpu probably threw away the cache lines from its caches, because lot
      of things happened between the allocation and final use.
      
      So the deal would be to allocate only the data buffer for the NIC to
      populate its RX ring buffer. And use build_skb() at RX completion to
      attach a data buffer (now filled with an ethernet frame) to a new skb,
      initialize the skb_shared_info portion, and give the hot skb to network
      stack.
      
      build_skb() is the function to allocate an skb, caller providing the
      data buffer that should be attached to it. Drivers are expected to call
      skb_reserve() right after build_skb() to adjust skb->data to the
      Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
      drivers might add a hardware provided alignment)
      
      Data provided to build_skb() MUST have been allocated by a prior
      kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
      skb_shared_info)) bytes at the end of the data without corrupting
      incoming frame.
      
      data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
                     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
      	       GFP_ATOMIC);
      ...
      skb = build_skb(data);
      if (!skb) {
      	recycle_data(data);
      } else {
      	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
      	...
      }
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Eilon Greenstein <eilong@broadcom.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Jamal Hadi Salim <hadi@mojatatu.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Thomas Graf <tgraf@infradead.org>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b5ce9d
  10. 14 11月, 2011 1 次提交
    • E
      neigh: new unresolved queue limits · 8b5c171b
      Eric Dumazet 提交于
      Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
      > From: David Miller <davem@davemloft.net>
      > Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
      >
      > > From: Eric Dumazet <eric.dumazet@gmail.com>
      > > Date: Wed, 09 Nov 2011 12:14:09 +0100
      > >
      > >> unres_qlen is the number of frames we are able to queue per unresolved
      > >> neighbour. Its default value (3) was never changed and is responsible
      > >> for strange drops, especially if IP fragments are used, or multiple
      > >> sessions start in parallel. Even a single tcp flow can hit this limit.
      > >  ...
      > >
      > > Ok, I've applied this, let's see what happens :-)
      >
      > Early answer, build fails.
      >
      > Please test build this patch with DECNET enabled and resubmit.  The
      > decnet neigh layer still refers to the removed ->queue_len member.
      >
      > Thanks.
      
      Ouch, this was fixed on one machine yesterday, but not the other one I
      used this morning, sorry.
      
      [PATCH V5 net-next] neigh: new unresolved queue limits
      
      unres_qlen is the number of frames we are able to queue per unresolved
      neighbour. Its default value (3) was never changed and is responsible
      for strange drops, especially if IP fragments are used, or multiple
      sessions start in parallel. Even a single tcp flow can hit this limit.
      
      $ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
      PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
      8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b5c171b
  11. 10 11月, 2011 1 次提交
    • J
      net: add wireless TX status socket option · 6e3e939f
      Johannes Berg 提交于
      The 802.1X EAPOL handshake hostapd does requires
      knowing whether the frame was ack'ed by the peer.
      Currently, we fudge this pretty badly by not even
      transmitting the frame as a normal data frame but
      injecting it with radiotap and getting the status
      out of radiotap monitor as well. This is rather
      complex, confuses users (mon.wlan0 presence) and
      doesn't work with all hardware.
      
      To get rid of that hack, introduce a real wifi TX
      status option for data frame transmissions.
      
      This works similar to the existing TX timestamping
      in that it reflects the SKB back to the socket's
      error queue with a SCM_WIFI_STATUS cmsg that has
      an int indicating ACK status (0/1).
      
      Since it is possible that at some point we will
      want to have TX timestamping and wifi status in a
      single errqueue SKB (there's little point in not
      doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
      to SO_EE_ORIGIN_TXSTATUS which can collect more
      than just the timestamp; keep the old constant
      as an alias of course. Currently the internal APIs
      don't make that possible, but it wouldn't be hard
      to split them up in a way that makes it possible.
      
      Thanks to Neil Horman for helping me figure out
      the functions that add the control messages.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      6e3e939f
  12. 09 11月, 2011 1 次提交
  13. 04 11月, 2011 1 次提交
    • T
      net: Add back alignment for size for __alloc_skb · bc417e30
      Tony Lindgren 提交于
      Commit 87fb4b7b (net: more
      accurate skb truesize) changed the alignment of size. This
      can cause problems at least on some machines with NFS root:
      
      Unhandled fault: alignment exception (0x801) at 0xc183a43a
      Internal error: : 801 [#1] PREEMPT
      Modules linked in:
      CPU: 0    Not tainted  (3.1.0-08784-g5eeee4a #733)
      pc : [<c02fbba0>]    lr : [<c02fbb9c>]    psr: 60000013
      sp : c180fef8  ip : 00000000  fp : c181f580
      r10: 00000000  r9 : c044b28c  r8 : 00000001
      r7 : c183a3a0  r6 : c1835be0  r5 : c183a412  r4 : 000001f2
      r3 : 00000000  r2 : 00000000  r1 : ffffffe6  r0 : c183a43a
      Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
      Control: 0005317f  Table: 10004000  DAC: 00000017
      Process swapper (pid: 1, stack limit = 0xc180e270)
      Stack: (0xc180fef8 to 0xc1810000)
      fee0:                                                       00000024 00000000
      ff00: 00000000 c183b9c0 c183b8e0 c044b28c c0507ccc c019dfc4 c180ff2c c0503cf8
      ff20: c180ff4c c180ff4c 00000000 c1835420 c182c740 c18349c0 c05233c0 00000000
      ff40: 00000000 c00e6bb8 c180e000 00000000 c04dd82c c0507e7c c050cc18 c183b9c0
      ff60: c05233c0 00000000 00000000 c01f34f4 c0430d70 c019d364 c04dd898 c04dd898
      ff80: c04dd82c c0507e7c c180e000 00000000 c04c584c c01f4918 c04dd898 c04dd82c
      ffa0: c04ddd28 c180e000 00000000 c0008758 c181fa60 3231d82c 00000037 00000000
      ffc0: 00000000 c04dd898 c04dd82c c04ddd28 00000013 00000000 00000000 00000000
      ffe0: 00000000 c04b2224 00000000 c04b21a0 c001056c c001056c 00000000 00000000
      Function entered at [<c02fbba0>] from [<c019dfc4>]
      Function entered at [<c019dfc4>] from [<c01f34f4>]
      Function entered at [<c01f34f4>] from [<c01f4918>]
      Function entered at [<c01f4918>] from [<c0008758>]
      Function entered at [<c0008758>] from [<c04b2224>]
      Function entered at [<c04b2224>] from [<c001056c>]
      Code: e1a00005 e3a01028 ebfa7cb0 e35a0000 (e5858028)
      
      Here PC is at __alloc_skb and &shinfo->dataref is unaligned because
      skb->end can be unaligned without this patch.
      
      As explained by Eric Dumazet <eric.dumazet@gmail.com>, this happens
      only with SLOB, and not with SLAB or SLUB:
      
      * Eric Dumazet <eric.dumazet@gmail.com> [111102 15:56]:
      >
      > Your patch is absolutely needed, I completely forgot about SLOB :(
      >
      > since, kmalloc(386) on SLOB gives exactly ksize=386 bytes, not nearest
      > power of two.
      >
      > [   60.305763] malloc(size=385)->ffff880112c11e38 ksize=386 -> nsize=2
      > [   60.305921] malloc(size=385)->ffff88007c92ce28 ksize=386 -> nsize=2
      > [   60.306898] malloc(size=656)->ffff88007c44ad28 ksize=656 -> nsize=272
      > [   60.325385] malloc(size=656)->ffff88007c575868 ksize=656 -> nsize=272
      > [   60.325531] malloc(size=656)->ffff88011c777230 ksize=656 -> nsize=272
      > [   60.325701] malloc(size=656)->ffff880114011008 ksize=656 -> nsize=272
      > [   60.346716] malloc(size=385)->ffff880114142008 ksize=386 -> nsize=2
      > [   60.346900] malloc(size=385)->ffff88011c777690 ksize=386 -> nsize=2
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc417e30
  14. 02 11月, 2011 1 次提交
  15. 01 11月, 2011 2 次提交
  16. 30 10月, 2011 1 次提交
    • E
      vlan: allow nested vlan_do_receive() · 6a32e4f9
      Eric Dumazet 提交于
      commit 2425717b (net: allow vlan traffic to be received under bond)
      broke ARP processing on vlan on top of bonding.
      
             +-------+
      eth0 --| bond0 |---bond0.103
      eth1 --|       |
             +-------+
      
      52870.115435: skb_gro_reset_offset <-napi_gro_receive
      52870.115435: dev_gro_receive <-napi_gro_receive
      52870.115435: napi_skb_finish <-napi_gro_receive
      52870.115435: netif_receive_skb <-napi_skb_finish
      52870.115435: get_rps_cpu <-netif_receive_skb
      52870.115435: __netif_receive_skb <-netif_receive_skb
      52870.115436: vlan_do_receive <-__netif_receive_skb
      52870.115436: bond_handle_frame <-__netif_receive_skb
      52870.115436: vlan_do_receive <-__netif_receive_skb
      52870.115436: arp_rcv <-__netif_receive_skb
      52870.115436: kfree_skb <-arp_rcv
      
      Packet is dropped in arp_rcv() because its pkt_type was set to
      PACKET_OTHERHOST in the first vlan_do_receive() call, since no eth0.103
      exists.
      
      We really need to change pkt_type only if no more rx_handler is about to
      be called for the packet.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a32e4f9
  17. 26 10月, 2011 1 次提交
  18. 24 10月, 2011 2 次提交
  19. 21 10月, 2011 3 次提交