1. 15 11月, 2011 1 次提交
    • E
      net: introduce build_skb() · b2b5ce9d
      Eric Dumazet 提交于
      One of the thing we discussed during netdev 2011 conference was the idea
      to change some network drivers to allocate/populate their skb at RX
      completion time, right before feeding the skb to network stack.
      
      In old days, we allocated skbs when populating the RX ring.
      
      This means bringing into cpu cache sk_buff and skb_shared_info cache
      lines (since we clear/initialize them), then 'queue' skb->data to NIC.
      
      By the time NIC fills a frame in skb->data buffer and host can process
      it, cpu probably threw away the cache lines from its caches, because lot
      of things happened between the allocation and final use.
      
      So the deal would be to allocate only the data buffer for the NIC to
      populate its RX ring buffer. And use build_skb() at RX completion to
      attach a data buffer (now filled with an ethernet frame) to a new skb,
      initialize the skb_shared_info portion, and give the hot skb to network
      stack.
      
      build_skb() is the function to allocate an skb, caller providing the
      data buffer that should be attached to it. Drivers are expected to call
      skb_reserve() right after build_skb() to adjust skb->data to the
      Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
      drivers might add a hardware provided alignment)
      
      Data provided to build_skb() MUST have been allocated by a prior
      kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
      skb_shared_info)) bytes at the end of the data without corrupting
      incoming frame.
      
      data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
                     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
      	       GFP_ATOMIC);
      ...
      skb = build_skb(data);
      if (!skb) {
      	recycle_data(data);
      } else {
      	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
      	...
      }
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Eilon Greenstein <eilong@broadcom.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Jamal Hadi Salim <hadi@mojatatu.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Thomas Graf <tgraf@infradead.org>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b5ce9d
  2. 14 11月, 2011 8 次提交
  3. 10 11月, 2011 2 次提交
    • E
      ipv4: PKTINFO doesnt need dst reference · d826eb14
      Eric Dumazet 提交于
      Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
      
      > At least, in recent kernels we dont change dst->refcnt in forwarding
      > patch (usinf NOREF skb->dst)
      >
      > One particular point is the atomic_inc(dst->refcnt) we have to perform
      > when queuing an UDP packet if socket asked PKTINFO stuff (for example a
      > typical DNS server has to setup this option)
      >
      > I have one patch somewhere that stores the information in skb->cb[] and
      > avoid the atomic_{inc|dec}(dst->refcnt).
      >
      
      OK I found it, I did some extra tests and believe its ready.
      
      [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
      
      When a socket uses IP_PKTINFO notifications, we currently force a dst
      reference for each received skb. Reader has to access dst to get needed
      information (rt_iif & rt_spec_dst) and must release dst reference.
      
      We also forced a dst reference if skb was put in socket backlog, even
      without IP_PKTINFO handling. This happens under stress/load.
      
      We can instead store the needed information in skb->cb[], so that only
      softirq handler really access dst, improving cache hit ratios.
      
      This removes two atomic operations per packet, and false sharing as
      well.
      
      On a benchmark using a mono threaded receiver (doing only recvmsg()
      calls), I can reach 720.000 pps instead of 570.000 pps.
      
      IP_PKTINFO is typically used by DNS servers, and any multihomed aware
      UDP application.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d826eb14
    • E
      ipv4: reduce percpu needs for icmpmsg mibs · acb32ba3
      Eric Dumazet 提交于
      Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
      (can be ~88000 us).
      
      This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
      for all possible cpus can read 16 Mbytes of memory.
      
      ICMP messages are not considered as fast path on a typical server, and
      eventually few cpus handle them anyway. We can afford an atomic
      operation instead of using percpu data.
      
      This saves 4096 bytes per cpu and per network namespace.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb32ba3
  4. 09 11月, 2011 11 次提交
  5. 08 11月, 2011 1 次提交
  6. 04 11月, 2011 4 次提交
    • O
      af_packet: de-inline some helper functions · eea49cc9
      Olof Johansson 提交于
      This popped some compiler errors due to mismatched prototypes. Just
      remove most manual inlines, the compiler should be able to figure out
      what makes sense to inline and not.
      
      net/packet/af_packet.c:252: warning: 'prb_curr_blk_in_use' declared inline after being called
      net/packet/af_packet.c:252: warning: previous declaration of 'prb_curr_blk_in_use' was here
      net/packet/af_packet.c:258: warning: 'prb_queue_frozen' declared inline after being called
      net/packet/af_packet.c:258: warning: previous declaration of 'prb_queue_frozen' was here
      net/packet/af_packet.c:248: warning: 'packet_previous_frame' declared inline after being called
      net/packet/af_packet.c:248: warning: previous declaration of 'packet_previous_frame' was here
      net/packet/af_packet.c:251: warning: 'packet_increment_head' declared inline after being called
      net/packet/af_packet.c:251: warning: previous declaration of 'packet_increment_head' was here
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Cc: Chetan Loke <loke.chetan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eea49cc9
    • T
      net: Add back alignment for size for __alloc_skb · bc417e30
      Tony Lindgren 提交于
      Commit 87fb4b7b (net: more
      accurate skb truesize) changed the alignment of size. This
      can cause problems at least on some machines with NFS root:
      
      Unhandled fault: alignment exception (0x801) at 0xc183a43a
      Internal error: : 801 [#1] PREEMPT
      Modules linked in:
      CPU: 0    Not tainted  (3.1.0-08784-g5eeee4a #733)
      pc : [<c02fbba0>]    lr : [<c02fbb9c>]    psr: 60000013
      sp : c180fef8  ip : 00000000  fp : c181f580
      r10: 00000000  r9 : c044b28c  r8 : 00000001
      r7 : c183a3a0  r6 : c1835be0  r5 : c183a412  r4 : 000001f2
      r3 : 00000000  r2 : 00000000  r1 : ffffffe6  r0 : c183a43a
      Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
      Control: 0005317f  Table: 10004000  DAC: 00000017
      Process swapper (pid: 1, stack limit = 0xc180e270)
      Stack: (0xc180fef8 to 0xc1810000)
      fee0:                                                       00000024 00000000
      ff00: 00000000 c183b9c0 c183b8e0 c044b28c c0507ccc c019dfc4 c180ff2c c0503cf8
      ff20: c180ff4c c180ff4c 00000000 c1835420 c182c740 c18349c0 c05233c0 00000000
      ff40: 00000000 c00e6bb8 c180e000 00000000 c04dd82c c0507e7c c050cc18 c183b9c0
      ff60: c05233c0 00000000 00000000 c01f34f4 c0430d70 c019d364 c04dd898 c04dd898
      ff80: c04dd82c c0507e7c c180e000 00000000 c04c584c c01f4918 c04dd898 c04dd82c
      ffa0: c04ddd28 c180e000 00000000 c0008758 c181fa60 3231d82c 00000037 00000000
      ffc0: 00000000 c04dd898 c04dd82c c04ddd28 00000013 00000000 00000000 00000000
      ffe0: 00000000 c04b2224 00000000 c04b21a0 c001056c c001056c 00000000 00000000
      Function entered at [<c02fbba0>] from [<c019dfc4>]
      Function entered at [<c019dfc4>] from [<c01f34f4>]
      Function entered at [<c01f34f4>] from [<c01f4918>]
      Function entered at [<c01f4918>] from [<c0008758>]
      Function entered at [<c0008758>] from [<c04b2224>]
      Function entered at [<c04b2224>] from [<c001056c>]
      Code: e1a00005 e3a01028 ebfa7cb0 e35a0000 (e5858028)
      
      Here PC is at __alloc_skb and &shinfo->dataref is unaligned because
      skb->end can be unaligned without this patch.
      
      As explained by Eric Dumazet <eric.dumazet@gmail.com>, this happens
      only with SLOB, and not with SLAB or SLUB:
      
      * Eric Dumazet <eric.dumazet@gmail.com> [111102 15:56]:
      >
      > Your patch is absolutely needed, I completely forgot about SLOB :(
      >
      > since, kmalloc(386) on SLOB gives exactly ksize=386 bytes, not nearest
      > power of two.
      >
      > [   60.305763] malloc(size=385)->ffff880112c11e38 ksize=386 -> nsize=2
      > [   60.305921] malloc(size=385)->ffff88007c92ce28 ksize=386 -> nsize=2
      > [   60.306898] malloc(size=656)->ffff88007c44ad28 ksize=656 -> nsize=272
      > [   60.325385] malloc(size=656)->ffff88007c575868 ksize=656 -> nsize=272
      > [   60.325531] malloc(size=656)->ffff88011c777230 ksize=656 -> nsize=272
      > [   60.325701] malloc(size=656)->ffff880114011008 ksize=656 -> nsize=272
      > [   60.346716] malloc(size=385)->ffff880114142008 ksize=386 -> nsize=2
      > [   60.346900] malloc(size=385)->ffff88011c777690 ksize=386 -> nsize=2
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc417e30
    • E
      net: add missing bh_unlock_sock() calls · 918eb399
      Eric Dumazet 提交于
      Simon Kirby reported lockdep warnings and following messages :
      
      [104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
      preempt_count 00000101, exited with 00000102?
      
      [104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
      preempt_count 00000101, exited with 00000102?
      
      Problem comes from commit 0e734419
      (ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)
      
      If inet_csk_route_child_sock() returns NULL, we should release socket
      lock before freeing it.
      
      Another lock imbalance exists if __inet_inherit_port() returns an error
      since commit 093d2823 ( tproxy: fix hash locking issue when using
      port redirection in __inet_inherit_port()) a backport is also needed for
      >= 2.6.37 kernels.
      Reported-by: NSimon Kirby <sim@hostway.ca>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Balazs Scheidler <bazsi@balabit.hu>
      CC: KOVACS Krisztian <hidden@balabit.hu>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NSimon Kirby <sim@hostway.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      918eb399
    • E
      l2tp: fix race in l2tp_recv_dequeue() · e2e210c0
      Eric Dumazet 提交于
      Misha Labjuk reported panics occurring in l2tp_recv_dequeue()
      
      If we release reorder_q.lock, we must not keep a dangling pointer (tmp),
      since another thread could manipulate reorder_q.
      
      Instead we must restart the scan at beginning of list.
      Reported-by: NMisha Labjuk <spiked.yar@gmail.com>
      Tested-by: NMisha Labjuk <spiked.yar@gmail.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2e210c0
  7. 03 11月, 2011 4 次提交
  8. 02 11月, 2011 5 次提交
    • E
      udp: fix a race in encap_rcv handling · 0ad92ad0
      Eric Dumazet 提交于
      udp_queue_rcv_skb() has a possible race in encap_rcv handling, since
      this pointer can be changed anytime.
      
      We should use ACCESS_ONCE() to close the race.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad92ad0
    • D
      x25: Fix NULL dereference in x25_recvmsg · 501e89d3
      Dave Jones 提交于
      commit cb101ed2 in 3.0 introduced a bug in x25_recvmsg()
      When passed bogus junk from userspace, x25->neighbour can be NULL,
      as shown in this oops..
      
      BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      IP: [<ffffffffa05482bd>] x25_recvmsg+0x4d/0x280 [x25]
      PGD 1015f3067 PUD 105072067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      CPU 0
      Pid: 27928, comm: iknowthis Not tainted 3.1.0+ #2 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
      RIP: 0010:[<ffffffffa05482bd>]  [<ffffffffa05482bd>] x25_recvmsg+0x4d/0x280 [x25]
      RSP: 0018:ffff88010c0b7cc8  EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff88010c0b7d78 RCX: 0000000000000c02
      RDX: ffff88010c0b7d78 RSI: ffff88011c93dc00 RDI: ffff880103f667b0
      RBP: ffff88010c0b7d18 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff880103f667b0
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f479ce7f700(0000) GS:ffff88012a600000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 000000000000001c CR3: 000000010529e000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process iknowthis (pid: 27928, threadinfo ffff88010c0b6000, task ffff880103faa4f0)
      Stack:
       0000000000000c02 0000000000000c02 ffff88010c0b7d18 ffffff958153cb37
       ffffffff8153cb60 0000000000000c02 ffff88011c93dc00 0000000000000000
       0000000000000c02 ffff88010c0b7e10 ffff88010c0b7de8 ffffffff815372c2
      Call Trace:
       [<ffffffff8153cb60>] ? sock_update_classid+0xb0/0x180
       [<ffffffff815372c2>] sock_aio_read.part.10+0x142/0x150
       [<ffffffff812d6752>] ? inode_has_perm+0x62/0xa0
       [<ffffffff815372fd>] sock_aio_read+0x2d/0x40
       [<ffffffff811b05e2>] do_sync_read+0xd2/0x110
       [<ffffffff812d3796>] ? security_file_permission+0x96/0xb0
       [<ffffffff811b0a91>] ? rw_verify_area+0x61/0x100
       [<ffffffff811b103d>] vfs_read+0x16d/0x180
       [<ffffffff811b109d>] sys_read+0x4d/0x90
       [<ffffffff81657282>] system_call_fastpath+0x16/0x1b
      Code: 8b 66 20 4c 8b 32 48 89 d3 48 89 4d b8 45 89 c7 c7 45 cc 95 ff ff ff 4d 85 e4 0f 84 ed 01 00 00 49 8b 84 24 18 05 00 00 4c 89 e7
       78 1c 01 45 19 ed 31 f6 e8 d5 37 ff e0 41 0f b6 44 24 0e 41
      Signed-off-by: NDave Jones <davej@redhat.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      501e89d3
    • A
      net: make the tcp and udp file_operations for the /proc stuff const · 73cb88ec
      Arjan van de Ven 提交于
      the tcp and udp code creates a set of struct file_operations at runtime
      while it can also be done at compile time, with the added benefit of then
      having these file operations be const.
      
      the trickiest part was to get the "THIS_MODULE" reference right; the naive
      method of declaring a struct in the place of registration would not work
      for this reason.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73cb88ec
    • M
      vlan: Don't propagate flag changes on down interfaces. · deede2fa
      Matthijs Kooijman 提交于
      When (de)configuring a vlan interface, the IFF_ALLMULTI ans IFF_PROMISC
      flags are cleared or set on the underlying interface. So, if these flags
      are changed on a vlan interface that is not up, the flags underlying
      interface might be set or cleared twice.
      
      Only propagating flag changes when a device is up makes sure this does
      not happen. It also makes sure that an underlying device is not set to
      promiscuous or allmulti mode for a vlan device that is down.
      Signed-off-by: NMatthijs Kooijman <matthijs@stdin.nl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deede2fa
    • D
      neigh: Kill bogus SMP protected debugging message. · 045f7b3b
      David S. Miller 提交于
      Whatever situations make this state legitimate when SMP
      also would be legitimate when !SMP and f.e. preemption is
      enabled.
      
      This is dubious enough that we should just delete it entirely.  If we
      want to add debugging for neigh timer races, better more thorough
      mechanisms are needed.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      045f7b3b
  9. 01 11月, 2011 4 次提交