1. 01 5月, 2012 1 次提交
    • E
      net: make GRO aware of skb->head_frag · d7e8883c
      Eric Dumazet 提交于
      GRO can check if skb to be merged has its skb->head mapped to a page
      fragment, instead of a kmalloc() area.
      
      We 'upgrade' skb->head as a fragment in itself
      
      This avoids the frag_list fallback, and permits to build true GRO skb
      (one sk_buff and up to 16 fragments), using less memory.
      
      This reduces number of cache misses when user makes its copy, since a
      single sk_buff is fetched.
      
      This is a followup of patch "net: allow skb->head to be a page fragment"
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7e8883c
  2. 20 4月, 2012 1 次提交
  3. 16 4月, 2012 1 次提交
  4. 13 4月, 2012 1 次提交
  5. 04 4月, 2012 1 次提交
  6. 29 3月, 2012 1 次提交
  7. 28 3月, 2012 1 次提交
    • B
      net/core: dev_forward_skb() should clear skb_iif · 3b9785c6
      Benjamin LaHaise 提交于
      While investigating another bug, I found that the code on the incoming path
      in __netif_receive_skb will only set skb->skb_iif if it is already 0.  When
      dev_forward_skb() is used in the case of interfaces like veth, skb_iif may
      already have been set.  Making dev_forward_skb() cause the packet to look
      like a newly received packet would seem to the the correct behaviour here,
      as otherwise the wrong incoming interface can be reported for such a packet.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b9785c6
  8. 22 3月, 2012 1 次提交
  9. 07 3月, 2012 1 次提交
  10. 06 3月, 2012 1 次提交
  11. 24 2月, 2012 1 次提交
    • I
      static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb
      Ingo Molnar 提交于
      static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
      
      So here's a boot tested patch on top of Jason's series that does
      all the cleanups I talked about and turns jump labels into a
      more intuitive to use facility. It should also address the
      various misconceptions and confusions that surround jump labels.
      
      Typical usage scenarios:
      
              #include <linux/static_key.h>
      
              struct static_key key = STATIC_KEY_INIT_TRUE;
      
              if (static_key_false(&key))
                      do unlikely code
              else
                      do likely code
      
      Or:
      
              if (static_key_true(&key))
                      do likely code
              else
                      do unlikely code
      
      The static key is modified via:
      
              static_key_slow_inc(&key);
              ...
              static_key_slow_dec(&key);
      
      The 'slow' prefix makes it abundantly clear that this is an
      expensive operation.
      
      I've updated all in-kernel code to use this everywhere. Note
      that I (intentionally) have not pushed through the rename
      blindly through to the lowest levels: the actual jump-label
      patching arch facility should be named like that, so we want to
      decouple jump labels from the static-key facility a bit.
      
      On non-jump-label enabled architectures static keys default to
      likely()/unlikely() branches.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: a.p.zijlstra@chello.nl
      Cc: mathieu.desnoyers@efficios.com
      Cc: davem@davemloft.net
      Cc: ddaney.cavm@gmail.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c5905afb
  12. 09 2月, 2012 2 次提交
    • E
      gro: more generic L2 header check · 5ca3b72c
      Eric Dumazet 提交于
      Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
      only, and failed on IB/ipoib traffic.
      
      He provided a patch faking a zeroed header to let GRO aggregates frames.
      
      Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
      check to be more generic, ie not assuming L2 header is 14 bytes, but
      taking into account hard_header_len.
      
      __napi_gro_receive() has special handling for the common case (Ethernet)
      to avoid a memcmp() call and use an inline optimized function instead.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NShlomo Pongratz <shlomop@mellanox.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Tested-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ca3b72c
    • E
      gro: more generic L2 header check · 43480aec
      Eric Dumazet 提交于
      Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
      only, and failed on IB/ipoib traffic.
      
      He provided a patch faking a zeroed header to let GRO aggregates frames.
      
      Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
      check to be more generic, ie not assuming L2 header is 14 bytes, but
      taking into account hard_header_len.
      
      __napi_gro_receive() has special handling for the common case (Ethernet)
      to avoid a memcmp() call and use an inline optimized function instead.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NShlomo Pongratz <shlomop@mellanox.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Tested-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43480aec
  13. 02 2月, 2012 1 次提交
  14. 18 1月, 2012 2 次提交
  15. 17 1月, 2012 1 次提交
  16. 02 12月, 2011 1 次提交
  17. 01 12月, 2011 1 次提交
  18. 30 11月, 2011 3 次提交
  19. 29 11月, 2011 3 次提交
    • E
      net: dont call jump_label_dec from irq context · b90e5794
      Eric Dumazet 提交于
      Igor Maravic reported an error caused by jump_label_dec() being called
      from IRQ context :
      
       BUG: sleeping function called from invalid context at kernel/mutex.c:271
       in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
       1 lock held by swapper/0:
        #0:  (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
       Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
      Call Trace:
       <IRQ>  [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
       [<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
       [<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
       [<ffffffff8109a37f>] ? local_clock+0x6f/0x80
       [<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
       [<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
       [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
       [<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
       [<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
       [<ffffffff81566525>] net_disable_timestamp+0x15/0x20
       [<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
       [<ffffffff81557b00>] __sk_free+0x80/0x200
       [<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
       [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
       [<ffffffff81557cba>] sock_wfree+0x3a/0x70
       [<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
       [<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
       [<ffffffff8155c119>] kfree_skb+0x49/0x170
       [<ffffffff815e936e>] arp_error_report+0x3e/0x90
       [<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
       [<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
       [<ffffffff81578d20>] ? neigh_update+0x640/0x640
       [<ffffffff81073558>] __do_softirq+0xc8/0x3a0
      
      Since jump_label_{inc|dec} must be called from process context only,
      we must defer jump_label_dec() if net_disable_timestamp() is called
      from interrupt context.
      Reported-by: NIgor Maravic <igorm@etf.rs>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b90e5794
    • E
      net: use skb_flow_dissect() in __skb_get_rxhash() · 4504b861
      Eric Dumazet 提交于
      No functional changes.
      
      This uses the code we factorized in skb_flow_dissect()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4504b861
    • A
      net: Fix corruption in /proc/*/net/dev_mcast · 5cac98dd
      Anton Blanchard 提交于
      I just hit this during my testing. Isn't there another bug lurking?
      
      BUG kmalloc-8: Redzone overwritten
      
      INFO: 0xc0000000de9dec48-0xc0000000de9dec4b. First byte 0x0 instead of 0xcc
      INFO: Allocated in .__seq_open_private+0x30/0xa0 age=0 cpu=5 pid=3896
      	.__kmalloc+0x1e0/0x2d0
      	.__seq_open_private+0x30/0xa0
      	.seq_open_net+0x60/0xe0
      	.dev_mc_seq_open+0x4c/0x70
      	.proc_reg_open+0xd8/0x260
      	.__dentry_open.clone.11+0x2b8/0x400
      	.do_last+0xf4/0x950
      	.path_openat+0xf8/0x480
      	.do_filp_open+0x48/0xc0
      	.do_sys_open+0x140/0x250
      	syscall_exit+0x0/0x40
      
      dev_mc_seq_ops uses dev_seq_start/next/stop but only allocates
      sizeof(struct seq_net_private) of private data, whereas it expects
      sizeof(struct dev_iter_state):
      
      struct dev_iter_state {
      	struct seq_net_private p;
      	unsigned int pos; /* bucket << BUCKET_SPACE + offset */
      };
      
      Create dev_seq_open_ops and use it so we don't have to expose
      struct dev_iter_state.
      
      [ Problem added by commit f04565dd (dev: use name hash for
        dev_seq_ops) -Eric ]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cac98dd
  20. 23 11月, 2011 1 次提交
    • N
      net: add network priority cgroup infrastructure (v4) · 5bc1421e
      Neil Horman 提交于
      This patch adds in the infrastructure code to create the network priority
      cgroup.  The cgroup, in addition to the standard processes file creates two
      control files:
      
      1) prioidx - This is a read-only file that exports the index of this cgroup.
      This is a value that is both arbitrary and unique to a cgroup in this subsystem,
      and is used to index the per-device priority map
      
      2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
      <name:priority> where name is the name of a network interface and priority is
      indicates the priority assigned to frames egresessing on the named interface and
      originating from a pid in this cgroup
      
      This cgroup allows for skb priority to be set prior to a root qdisc getting
      selected. This is benenficial for DCB enabled systems, in that it allows for any
      application to use dcb configured priorities so without application modification
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      CC: Robert Love <robert.w.love@intel.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bc1421e
  21. 18 11月, 2011 1 次提交
    • E
      net: use jump_label to shortcut RPS if not setup · adc9300e
      Eric Dumazet 提交于
      Most machines dont use RPS/RFS, and pay a fair amount of instructions in
      netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
      RPS/RFS is not setup.
      
      Add a jump_label named rps_needed.
      
      If no device rps_map or global rps_sock_flow_table is setup,
      netif_receive_skb() / netif_rx() do a single instruction instead of many
      ones, including conditional jumps.
      
      jmp +0    (if CONFIG_JUMP_LABEL=y)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adc9300e
  22. 17 11月, 2011 4 次提交
  23. 30 10月, 2011 1 次提交
    • E
      vlan: allow nested vlan_do_receive() · 6a32e4f9
      Eric Dumazet 提交于
      commit 2425717b (net: allow vlan traffic to be received under bond)
      broke ARP processing on vlan on top of bonding.
      
             +-------+
      eth0 --| bond0 |---bond0.103
      eth1 --|       |
             +-------+
      
      52870.115435: skb_gro_reset_offset <-napi_gro_receive
      52870.115435: dev_gro_receive <-napi_gro_receive
      52870.115435: napi_skb_finish <-napi_gro_receive
      52870.115435: netif_receive_skb <-napi_skb_finish
      52870.115435: get_rps_cpu <-netif_receive_skb
      52870.115435: __netif_receive_skb <-netif_receive_skb
      52870.115436: vlan_do_receive <-__netif_receive_skb
      52870.115436: bond_handle_frame <-__netif_receive_skb
      52870.115436: vlan_do_receive <-__netif_receive_skb
      52870.115436: arp_rcv <-__netif_receive_skb
      52870.115436: kfree_skb <-arp_rcv
      
      Packet is dropped in arp_rcv() because its pkt_type was set to
      PACKET_OTHERHOST in the first vlan_do_receive() call, since no eth0.103
      exists.
      
      We really need to change pkt_type only if no more rx_handler is about to
      be called for the packet.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a32e4f9
  24. 24 10月, 2011 1 次提交
  25. 21 10月, 2011 1 次提交
  26. 20 10月, 2011 2 次提交
    • R
      net: validate HWTSTAMP ioctl parameters · 4dc360c5
      Richard Cochran 提交于
      This patch adds a sanity check on the values provided by user space for
      the hardware time stamping configuration. If the values lie outside of
      the absolute limits, then the ioctl request will be denied.
      Signed-off-by: NRichard Cochran <richard.cochran@omicron.at>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dc360c5
    • E
      net: Move rcu_barrier from rollback_registered_many to netdev_run_todo. · 850a545b
      Eric W. Biederman 提交于
      This patch moves the rcu_barrier from rollback_registered_many
      (inside the rtnl_lock) into netdev_run_todo (just outside the rtnl_lock).
      This allows us to gain the full benefit of sychronize_net calling
      synchronize_rcu_expedited when the rtnl_lock is held.
      
      The rcu_barrier in rollback_registered_many was originally a synchronize_net
      but was promoted to be a rcu_barrier() when it was found that people were
      unnecessarily hitting the 250ms wait in netdev_wait_allrefs().  Changing
      the rcu_barrier back to a synchronize_net is therefore safe.
      
      Since we only care about waiting for the rcu callbacks before we get
      to netdev_wait_allrefs() it is also safe to move the wait into
      netdev_run_todo.
      
      This was tested by creating and destroying 1000 tap devices and observing
      /proc/lock_stat.  /proc/lock_stat reports this change reduces the hold
      times of the rtnl_lock by a factor of 10.  There was no observable
      difference in the amount of time it takes to destroy a network device.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      850a545b
  27. 19 10月, 2011 2 次提交
    • E
      net: add skb frag size accessors · 9e903e08
      Eric Dumazet 提交于
      To ease skb->truesize sanitization, its better to be able to localize
      all references to skb frags size.
      
      Define accessors : skb_frag_size() to fetch frag size, and
      skb_frag_size_{set|add|sub}() to manipulate it.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e903e08
    • J
      net: allow vlan traffic to be received under bond · 2425717b
      John Fastabend 提交于
      The following configuration used to work as I expected. At least
      we could use the fcoe interfaces to do MPIO and the bond0 iface
      to do load balancing or failover.
      
             ---eth2.228-fcoe
             |
      eth2 -----|
                |
                |---- bond0
                |
      eth3 -----|
             |
             ---eth3.228-fcoe
      
      This worked because of a change we added to allow inactive slaves
      to rx 'exact' matches. This functionality was kept intact with the
      rx_handler mechanism. However now the vlan interface attached to the
      active slave never receives traffic because the bonding rx_handler
      updates the skb->dev and goto's another_round. Previously, the
      vlan_do_receive() logic was called before the bonding rx_handler.
      
      Now by the time vlan_do_receive calls vlan_find_dev() the
      skb->dev is set to bond0 and it is clear no vlan is attached
      to this iface. The vlan lookup fails.
      
      This patch moves the VLAN check above the rx_handler. A VLAN
      tagged frame is now routed to the eth2.228-fcoe iface in the
      above schematic. Untagged frames continue to the bond0 as
      normal. This case also remains intact,
      
      eth2 --> bond0 --> vlan.228
      
      Here the skb is VLAN tagged but the vlan lookup fails on eth2
      causing the bonding rx_handler to be called. On the second
      pass the vlan lookup is on the bond0 iface and completes as
      expected.
      
      Putting a VLAN.228 on both the bond0 and eth2 device will
      result in eth2.228 receiving the skb. I don't think this is
      completely unexpected and was the result prior to the rx_handler
      result.
      
      Note, the same setup is also used for other storage traffic that
      MPIO is used with eg. iSCSI and similar setups can be contrived
      without storage protocols.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NJesse Gross <jesse@nicira.com>
      Reviewed-by: NJiri Pirko <jpirko@redhat.com>
      Tested-by: NHans Schillstrom <hams.schillstrom@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2425717b
  28. 04 10月, 2011 1 次提交
  29. 29 9月, 2011 1 次提交