1. 29 11月, 2011 3 次提交
    • E
      net: dont call jump_label_dec from irq context · b90e5794
      Eric Dumazet 提交于
      Igor Maravic reported an error caused by jump_label_dec() being called
      from IRQ context :
      
       BUG: sleeping function called from invalid context at kernel/mutex.c:271
       in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
       1 lock held by swapper/0:
        #0:  (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
       Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
      Call Trace:
       <IRQ>  [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
       [<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
       [<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
       [<ffffffff8109a37f>] ? local_clock+0x6f/0x80
       [<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
       [<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
       [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
       [<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
       [<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
       [<ffffffff81566525>] net_disable_timestamp+0x15/0x20
       [<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
       [<ffffffff81557b00>] __sk_free+0x80/0x200
       [<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
       [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
       [<ffffffff81557cba>] sock_wfree+0x3a/0x70
       [<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
       [<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
       [<ffffffff8155c119>] kfree_skb+0x49/0x170
       [<ffffffff815e936e>] arp_error_report+0x3e/0x90
       [<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
       [<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
       [<ffffffff81578d20>] ? neigh_update+0x640/0x640
       [<ffffffff81073558>] __do_softirq+0xc8/0x3a0
      
      Since jump_label_{inc|dec} must be called from process context only,
      we must defer jump_label_dec() if net_disable_timestamp() is called
      from interrupt context.
      Reported-by: NIgor Maravic <igorm@etf.rs>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b90e5794
    • E
      net: use skb_flow_dissect() in __skb_get_rxhash() · 4504b861
      Eric Dumazet 提交于
      No functional changes.
      
      This uses the code we factorized in skb_flow_dissect()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4504b861
    • E
      net: introduce skb_flow_dissect() · 0744dd00
      Eric Dumazet 提交于
      We use at least two flow dissectors in network stack, with known
      limitations and code duplication.
      
      Introduce skb_flow_dissect() to factorize this, highly inspired from
      existing dissector from __skb_get_rxhash()
      
      Note : We extensively use skb_header_pointer(), this permits us to not
      touch skb at all.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0744dd00
  2. 26 11月, 2011 1 次提交
  3. 24 11月, 2011 1 次提交
  4. 23 11月, 2011 3 次提交
  5. 19 11月, 2011 1 次提交
    • H
      net: Remove all uses of LL_ALLOCATED_SPACE · ae641949
      Herbert Xu 提交于
      net: Remove all uses of LL_ALLOCATED_SPACE
      
      The macro LL_ALLOCATED_SPACE was ill-conceived.  It applies the
      alignment to the sum of needed_headroom and needed_tailroom.  As
      the amount that is then reserved for head room is needed_headroom
      with alignment, this means that the tail room left may be too small.
      
      This patch replaces all uses of LL_ALLOCATED_SPACE with the macro
      LL_RESERVED_SPACE and direct reference to needed_tailroom.
      
      This also fixes the problem with needed_headroom changing between
      allocating the skb and reserving the head room.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae641949
  6. 18 11月, 2011 1 次提交
    • E
      net: use jump_label to shortcut RPS if not setup · adc9300e
      Eric Dumazet 提交于
      Most machines dont use RPS/RFS, and pay a fair amount of instructions in
      netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
      RPS/RFS is not setup.
      
      Add a jump_label named rps_needed.
      
      If no device rps_map or global rps_sock_flow_table is setup,
      netif_receive_skb() / netif_rx() do a single instruction instead of many
      ones, including conditional jumps.
      
      jmp +0    (if CONFIG_JUMP_LABEL=y)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adc9300e
  7. 17 11月, 2011 10 次提交
  8. 15 11月, 2011 1 次提交
    • E
      net: introduce build_skb() · b2b5ce9d
      Eric Dumazet 提交于
      One of the thing we discussed during netdev 2011 conference was the idea
      to change some network drivers to allocate/populate their skb at RX
      completion time, right before feeding the skb to network stack.
      
      In old days, we allocated skbs when populating the RX ring.
      
      This means bringing into cpu cache sk_buff and skb_shared_info cache
      lines (since we clear/initialize them), then 'queue' skb->data to NIC.
      
      By the time NIC fills a frame in skb->data buffer and host can process
      it, cpu probably threw away the cache lines from its caches, because lot
      of things happened between the allocation and final use.
      
      So the deal would be to allocate only the data buffer for the NIC to
      populate its RX ring buffer. And use build_skb() at RX completion to
      attach a data buffer (now filled with an ethernet frame) to a new skb,
      initialize the skb_shared_info portion, and give the hot skb to network
      stack.
      
      build_skb() is the function to allocate an skb, caller providing the
      data buffer that should be attached to it. Drivers are expected to call
      skb_reserve() right after build_skb() to adjust skb->data to the
      Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
      drivers might add a hardware provided alignment)
      
      Data provided to build_skb() MUST have been allocated by a prior
      kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
      skb_shared_info)) bytes at the end of the data without corrupting
      incoming frame.
      
      data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
                     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
      	       GFP_ATOMIC);
      ...
      skb = build_skb(data);
      if (!skb) {
      	recycle_data(data);
      } else {
      	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
      	...
      }
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Eilon Greenstein <eilong@broadcom.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Jamal Hadi Salim <hadi@mojatatu.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Thomas Graf <tgraf@infradead.org>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b5ce9d
  9. 14 11月, 2011 1 次提交
    • E
      neigh: new unresolved queue limits · 8b5c171b
      Eric Dumazet 提交于
      Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
      > From: David Miller <davem@davemloft.net>
      > Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
      >
      > > From: Eric Dumazet <eric.dumazet@gmail.com>
      > > Date: Wed, 09 Nov 2011 12:14:09 +0100
      > >
      > >> unres_qlen is the number of frames we are able to queue per unresolved
      > >> neighbour. Its default value (3) was never changed and is responsible
      > >> for strange drops, especially if IP fragments are used, or multiple
      > >> sessions start in parallel. Even a single tcp flow can hit this limit.
      > >  ...
      > >
      > > Ok, I've applied this, let's see what happens :-)
      >
      > Early answer, build fails.
      >
      > Please test build this patch with DECNET enabled and resubmit.  The
      > decnet neigh layer still refers to the removed ->queue_len member.
      >
      > Thanks.
      
      Ouch, this was fixed on one machine yesterday, but not the other one I
      used this morning, sorry.
      
      [PATCH V5 net-next] neigh: new unresolved queue limits
      
      unres_qlen is the number of frames we are able to queue per unresolved
      neighbour. Its default value (3) was never changed and is responsible
      for strange drops, especially if IP fragments are used, or multiple
      sessions start in parallel. Even a single tcp flow can hit this limit.
      
      $ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
      PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
      8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b5c171b
  10. 10 11月, 2011 1 次提交
    • J
      net: add wireless TX status socket option · 6e3e939f
      Johannes Berg 提交于
      The 802.1X EAPOL handshake hostapd does requires
      knowing whether the frame was ack'ed by the peer.
      Currently, we fudge this pretty badly by not even
      transmitting the frame as a normal data frame but
      injecting it with radiotap and getting the status
      out of radiotap monitor as well. This is rather
      complex, confuses users (mon.wlan0 presence) and
      doesn't work with all hardware.
      
      To get rid of that hack, introduce a real wifi TX
      status option for data frame transmissions.
      
      This works similar to the existing TX timestamping
      in that it reflects the SKB back to the socket's
      error queue with a SCM_WIFI_STATUS cmsg that has
      an int indicating ACK status (0/1).
      
      Since it is possible that at some point we will
      want to have TX timestamping and wifi status in a
      single errqueue SKB (there's little point in not
      doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
      to SO_EE_ORIGIN_TXSTATUS which can collect more
      than just the timestamp; keep the old constant
      as an alias of course. Currently the internal APIs
      don't make that possible, but it wouldn't be hard
      to split them up in a way that makes it possible.
      
      Thanks to Neil Horman for helping me figure out
      the functions that add the control messages.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      6e3e939f
  11. 09 11月, 2011 1 次提交
  12. 04 11月, 2011 1 次提交
    • T
      net: Add back alignment for size for __alloc_skb · bc417e30
      Tony Lindgren 提交于
      Commit 87fb4b7b (net: more
      accurate skb truesize) changed the alignment of size. This
      can cause problems at least on some machines with NFS root:
      
      Unhandled fault: alignment exception (0x801) at 0xc183a43a
      Internal error: : 801 [#1] PREEMPT
      Modules linked in:
      CPU: 0    Not tainted  (3.1.0-08784-g5eeee4a #733)
      pc : [<c02fbba0>]    lr : [<c02fbb9c>]    psr: 60000013
      sp : c180fef8  ip : 00000000  fp : c181f580
      r10: 00000000  r9 : c044b28c  r8 : 00000001
      r7 : c183a3a0  r6 : c1835be0  r5 : c183a412  r4 : 000001f2
      r3 : 00000000  r2 : 00000000  r1 : ffffffe6  r0 : c183a43a
      Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
      Control: 0005317f  Table: 10004000  DAC: 00000017
      Process swapper (pid: 1, stack limit = 0xc180e270)
      Stack: (0xc180fef8 to 0xc1810000)
      fee0:                                                       00000024 00000000
      ff00: 00000000 c183b9c0 c183b8e0 c044b28c c0507ccc c019dfc4 c180ff2c c0503cf8
      ff20: c180ff4c c180ff4c 00000000 c1835420 c182c740 c18349c0 c05233c0 00000000
      ff40: 00000000 c00e6bb8 c180e000 00000000 c04dd82c c0507e7c c050cc18 c183b9c0
      ff60: c05233c0 00000000 00000000 c01f34f4 c0430d70 c019d364 c04dd898 c04dd898
      ff80: c04dd82c c0507e7c c180e000 00000000 c04c584c c01f4918 c04dd898 c04dd82c
      ffa0: c04ddd28 c180e000 00000000 c0008758 c181fa60 3231d82c 00000037 00000000
      ffc0: 00000000 c04dd898 c04dd82c c04ddd28 00000013 00000000 00000000 00000000
      ffe0: 00000000 c04b2224 00000000 c04b21a0 c001056c c001056c 00000000 00000000
      Function entered at [<c02fbba0>] from [<c019dfc4>]
      Function entered at [<c019dfc4>] from [<c01f34f4>]
      Function entered at [<c01f34f4>] from [<c01f4918>]
      Function entered at [<c01f4918>] from [<c0008758>]
      Function entered at [<c0008758>] from [<c04b2224>]
      Function entered at [<c04b2224>] from [<c001056c>]
      Code: e1a00005 e3a01028 ebfa7cb0 e35a0000 (e5858028)
      
      Here PC is at __alloc_skb and &shinfo->dataref is unaligned because
      skb->end can be unaligned without this patch.
      
      As explained by Eric Dumazet <eric.dumazet@gmail.com>, this happens
      only with SLOB, and not with SLAB or SLUB:
      
      * Eric Dumazet <eric.dumazet@gmail.com> [111102 15:56]:
      >
      > Your patch is absolutely needed, I completely forgot about SLOB :(
      >
      > since, kmalloc(386) on SLOB gives exactly ksize=386 bytes, not nearest
      > power of two.
      >
      > [   60.305763] malloc(size=385)->ffff880112c11e38 ksize=386 -> nsize=2
      > [   60.305921] malloc(size=385)->ffff88007c92ce28 ksize=386 -> nsize=2
      > [   60.306898] malloc(size=656)->ffff88007c44ad28 ksize=656 -> nsize=272
      > [   60.325385] malloc(size=656)->ffff88007c575868 ksize=656 -> nsize=272
      > [   60.325531] malloc(size=656)->ffff88011c777230 ksize=656 -> nsize=272
      > [   60.325701] malloc(size=656)->ffff880114011008 ksize=656 -> nsize=272
      > [   60.346716] malloc(size=385)->ffff880114142008 ksize=386 -> nsize=2
      > [   60.346900] malloc(size=385)->ffff88011c777690 ksize=386 -> nsize=2
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc417e30
  13. 02 11月, 2011 1 次提交
  14. 01 11月, 2011 2 次提交
  15. 30 10月, 2011 1 次提交
    • E
      vlan: allow nested vlan_do_receive() · 6a32e4f9
      Eric Dumazet 提交于
      commit 2425717b (net: allow vlan traffic to be received under bond)
      broke ARP processing on vlan on top of bonding.
      
             +-------+
      eth0 --| bond0 |---bond0.103
      eth1 --|       |
             +-------+
      
      52870.115435: skb_gro_reset_offset <-napi_gro_receive
      52870.115435: dev_gro_receive <-napi_gro_receive
      52870.115435: napi_skb_finish <-napi_gro_receive
      52870.115435: netif_receive_skb <-napi_skb_finish
      52870.115435: get_rps_cpu <-netif_receive_skb
      52870.115435: __netif_receive_skb <-netif_receive_skb
      52870.115436: vlan_do_receive <-__netif_receive_skb
      52870.115436: bond_handle_frame <-__netif_receive_skb
      52870.115436: vlan_do_receive <-__netif_receive_skb
      52870.115436: arp_rcv <-__netif_receive_skb
      52870.115436: kfree_skb <-arp_rcv
      
      Packet is dropped in arp_rcv() because its pkt_type was set to
      PACKET_OTHERHOST in the first vlan_do_receive() call, since no eth0.103
      exists.
      
      We really need to change pkt_type only if no more rx_handler is about to
      be called for the packet.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a32e4f9
  16. 26 10月, 2011 1 次提交
  17. 24 10月, 2011 2 次提交
  18. 21 10月, 2011 4 次提交
  19. 20 10月, 2011 4 次提交
    • I
      net: do not take an additional reference in skb_frag_set_page · a0bec1cd
      Ian Campbell 提交于
      I audited all of the callers in the tree and only one of them (pktgen) expects
      it to do so. Taking this reference is pretty obviously confusing and error
      prone.
      
      In particular I looked at the following commits which switched callers of
      (__)skb_frag_set_page to the skb paged fragment api:
      
      6a930b9f cxgb3: convert to SKB paged frag API.
      5dc3e196 myri10ge: convert to SKB paged frag API.
      0e0634d2 vmxnet3: convert to SKB paged frag API.
      86ee8130 virtionet: convert to SKB paged frag API.
      4a22c4c9 sfc: convert to SKB paged frag API.
      18324d69 cassini: convert to SKB paged frag API.
      b061b39e benet: convert to SKB paged frag API.
      b7b6a688 bnx2: convert to SKB paged frag API.
      804cf14e net: xfrm: convert to SKB frag APIs
      ea2ab693 net: convert core to skb paged frag APIs
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0bec1cd
    • R
      neigh: fix rcu splat in neigh_update() · e049f288
      roy.qing.li@gmail.com 提交于
      when use dst_get_neighbour to get neighbour, we need
      rcu_read_lock to protect, since dst_get_neighbour uses
      rcu_dereference.
      
      The bug was reported by Ari Savolainen <ari.m.savolainen@gmail.com>
      
      [  105.612095]
      [  105.612096] ===================================================
      [  105.612100] [ INFO: suspicious rcu_dereference_check() usage. ]
      [  105.612101] ---------------------------------------------------
      [  105.612103] include/net/dst.h:91 invoked rcu_dereference_check()
      without protection!
      [  105.612105]
      [  105.612106] other info that might help us debug this:
      [  105.612106]
      [  105.612108]
      [  105.612108] rcu_scheduler_active = 1, debug_locks = 0
      [  105.612110] 1 lock held by dnsmasq/2618:
      [  105.612111]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815df8c7>]
      rtnl_lock+0x17/0x20
      [  105.612120]
      [  105.612121] stack backtrace:
      [  105.612123] Pid: 2618, comm: dnsmasq Not tainted 3.1.0-rc1 #41
      [  105.612125] Call Trace:
      [  105.612129]  [<ffffffff810ccdcb>] lockdep_rcu_dereference+0xbb/0xc0
      [  105.612132]  [<ffffffff815dc5a9>] neigh_update+0x4f9/0x5f0
      [  105.612135]  [<ffffffff815da001>] ? neigh_lookup+0xe1/0x220
      [  105.612139]  [<ffffffff81639298>] arp_req_set+0xb8/0x230
      [  105.612142]  [<ffffffff8163a59f>] arp_ioctl+0x1bf/0x310
      [  105.612146]  [<ffffffff810baa40>] ? lock_hrtimer_base.isra.26+0x30/0x60
      [  105.612150]  [<ffffffff8163fb75>] inet_ioctl+0x85/0x90
      [  105.612154]  [<ffffffff815b5520>] sock_do_ioctl+0x30/0x70
      [  105.612157]  [<ffffffff815b55d3>] sock_ioctl+0x73/0x280
      [  105.612162]  [<ffffffff811b7698>] do_vfs_ioctl+0x98/0x570
      [  105.612165]  [<ffffffff811a5c40>] ? fget_light+0x340/0x3a0
      [  105.612168]  [<ffffffff811b7bbf>] sys_ioctl+0x4f/0x80
      [  105.612172]  [<ffffffff816fdcab>] system_call_fastpath+0x16/0x1b
      Reported-by: NAri Savolainen <ari.m.savolainen@gmail.com>
      Signed-off-by: NRongQing <roy.qing.li@gmail.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e049f288
    • D
      filter: use unsigned int to silence static checker warning · 4f25af27
      Dan Carpenter 提交于
      This is just a cleanup.
      
      My testing version of Smatch warns about this:
      net/core/filter.c +380 check_load_and_stores(6)
      	warn: check 'flen' for negative values
      
      flen comes from the user.  We try to clamp the values here between 1
      and BPF_MAXINSNS but the clamp doesn't work because it could be
      negative.  This is a bug, but it's not exploitable.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f25af27
    • Y
      fib_rules: fix unresolved_rules counting · afaef734
      Yan, Zheng 提交于
      we should decrease ops->unresolved_rules when deleting a unresolved rule.
      Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afaef734