1. 27 12月, 2014 1 次提交
    • J
      net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding · 2c26d34b
      Jay Vosburgh 提交于
      When using VXLAN tunnels and a sky2 device, I have experienced
      checksum failures of the following type:
      
      [ 4297.761899] eth0: hw csum failure
      [...]
      [ 4297.765223] Call Trace:
      [ 4297.765224]  <IRQ>  [<ffffffff8172f026>] dump_stack+0x46/0x58
      [ 4297.765235]  [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
      [ 4297.765238]  [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
      [ 4297.765240]  [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
      [ 4297.765243]  [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
      [ 4297.765246]  [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360
      
      	These are reliably reproduced in a network topology of:
      
      container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch
      
      	When VXLAN encapsulated traffic is received from a similarly
      configured peer, the above warning is generated in the receive
      processing of the encapsulated packet.  Note that the warning is
      associated with the container eth0.
      
              The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
      because the packet is an encapsulated Ethernet frame, the checksum
      generated by the hardware includes the inner protocol and Ethernet
      headers.
      
      	The receive code is careful to update the skb->csum, except in
      __dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
      calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
      to skip over the Ethernet header, but does not update skb->csum when
      doing so.
      
      	This patch resolves the problem by adding a call to
      skb_postpull_rcsum to update the skb->csum after the call to
      eth_type_trans.
      Signed-off-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c26d34b
  2. 24 12月, 2014 7 次提交
  3. 11 12月, 2014 1 次提交
    • A
      net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb · fd11a83d
      Alexander Duyck 提交于
      This change pulls the core functionality out of __netdev_alloc_skb and
      places them in a new function named __alloc_rx_skb.  The reason for doing
      this is to make these bits accessible to a new function __napi_alloc_skb.
      In addition __alloc_rx_skb now has a new flags value that is used to
      determine which page frag pool to allocate from.  If the SKB_ALLOC_NAPI
      flag is set then the NAPI pool is used.  The advantage of this is that we
      do not have to use local_irq_save/restore when accessing the NAPI pool from
      NAPI context.
      
      In my test setup I saw at least 11ns of savings using the napi_alloc_skb
      function versus the netdev_alloc_skb function, most of this being due to
      the fact that we didn't have to call local_irq_save/restore.
      
      The main use case for napi_alloc_skb would be for things such as copybreak
      or page fragment based receive paths where an skb is allocated after the
      data has been received instead of before.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd11a83d
  4. 10 12月, 2014 2 次提交
    • L
      net: avoid to call skb_queue_len again · e008f3f0
      Li RongQing 提交于
      the queue length of sd->input_pkt_queue has been put into qlen,
      and impossible to change, since hold the lock
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e008f3f0
    • M
      rtnetlink: delay RTM_DELLINK notification until after ndo_uninit() · 395eea6c
      Mahesh Bandewar 提交于
      The commit 56bfa7ee ("unregister_netdevice : move RTM_DELLINK to
      until after ndo_uninit") tried to do this ealier but while doing so
      it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
      delayed call to fill_info(). So this translated into asking driver
      to remove private state and then query it's private state. This
      could have catastropic consequences.
      
      This change breaks the rtmsg_ifinfo() into two parts - one takes the
      precise snapshot of the device by called fill_info() before calling
      the ndo_uninit() and the second part sends the notification using
      collected snapshot.
      
      It was brought to notice when last link is deleted from an ipvlan device
      when it has free-ed the port and the subsequent .fill_info() call is
      trying to get the info from the port.
      
      kernel: [  255.139429] ------------[ cut here ]------------
      kernel: [  255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
      kernel: [  255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
      kernel: [  255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
      kernel: [  255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
      kernel: [  255.139515]  0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
      kernel: [  255.139516]  0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
      kernel: [  255.139518]  00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
      kernel: [  255.139520] Call Trace:
      kernel: [  255.139527]  [<ffffffff815d87f4>] dump_stack+0x46/0x58
      kernel: [  255.139531]  [<ffffffff8109c29c>] warn_slowpath_common+0x8c/0xc0
      kernel: [  255.139540]  [<ffffffff8109c2ea>] warn_slowpath_null+0x1a/0x20
      kernel: [  255.139544]  [<ffffffff8150d570>] rtmsg_ifinfo+0x100/0x110
      kernel: [  255.139547]  [<ffffffff814f78b5>] rollback_registered_many+0x1d5/0x2d0
      kernel: [  255.139549]  [<ffffffff814f79cf>] unregister_netdevice_many+0x1f/0xb0
      kernel: [  255.139551]  [<ffffffff8150acab>] rtnl_dellink+0xbb/0x110
      kernel: [  255.139553]  [<ffffffff8150da90>] rtnetlink_rcv_msg+0xa0/0x240
      kernel: [  255.139557]  [<ffffffff81329283>] ? rhashtable_lookup_compare+0x43/0x80
      kernel: [  255.139558]  [<ffffffff8150d9f0>] ? __rtnl_unlock+0x20/0x20
      kernel: [  255.139562]  [<ffffffff8152cb11>] netlink_rcv_skb+0xb1/0xc0
      kernel: [  255.139563]  [<ffffffff8150a495>] rtnetlink_rcv+0x25/0x40
      kernel: [  255.139565]  [<ffffffff8152c398>] netlink_unicast+0x178/0x230
      kernel: [  255.139567]  [<ffffffff8152c75f>] netlink_sendmsg+0x30f/0x420
      kernel: [  255.139571]  [<ffffffff814e0b0c>] sock_sendmsg+0x9c/0xd0
      kernel: [  255.139575]  [<ffffffff811d1d7f>] ? rw_copy_check_uvector+0x6f/0x130
      kernel: [  255.139577]  [<ffffffff814e11c9>] ? copy_msghdr_from_user+0x139/0x1b0
      kernel: [  255.139578]  [<ffffffff814e1774>] ___sys_sendmsg+0x304/0x310
      kernel: [  255.139581]  [<ffffffff81198723>] ? handle_mm_fault+0xca3/0xde0
      kernel: [  255.139585]  [<ffffffff811ebc4c>] ? destroy_inode+0x3c/0x70
      kernel: [  255.139589]  [<ffffffff8108e6ec>] ? __do_page_fault+0x20c/0x500
      kernel: [  255.139597]  [<ffffffff811e8336>] ? dput+0xb6/0x190
      kernel: [  255.139606]  [<ffffffff811f05f6>] ? mntput+0x26/0x40
      kernel: [  255.139611]  [<ffffffff811d2b94>] ? __fput+0x174/0x1e0
      kernel: [  255.139613]  [<ffffffff814e2129>] __sys_sendmsg+0x49/0x90
      kernel: [  255.139615]  [<ffffffff814e2182>] SyS_sendmsg+0x12/0x20
      kernel: [  255.139617]  [<ffffffff815df092>] system_call_fastpath+0x12/0x17
      kernel: [  255.139619] ---[ end trace 5e6703e87d984f6b ]---
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reported-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      395eea6c
  5. 03 12月, 2014 1 次提交
  6. 22 11月, 2014 2 次提交
  7. 14 11月, 2014 1 次提交
    • M
      net: generic dev_disable_lro() stacked device handling · fbe168ba
      Michal Kubeček 提交于
      Large receive offloading is known to cause problems if received packets
      are passed to other host. Therefore the kernel disables it by calling
      dev_disable_lro() whenever a network device is enslaved in a bridge or
      forwarding is enabled for it (or globally). For virtual devices we need
      to disable LRO on the underlying physical device (which is actually
      receiving the packets).
      
      Current dev_disable_lro() code handles this  propagation for a vlan
      (including 802.1ad nested vlan), macvlan or a vlan on top of a macvlan.
      It doesn't handle other stacked devices and their combinations, in
      particular propagation from a bond to its slaves which often causes
      problems in virtualization setups.
      
      As we now have generic data structures describing the upper-lower device
      relationship, dev_disable_lro() can be generalized to disable LRO also
      for all lower devices (if any) once it is disabled for the device
      itself.
      
      For bonding and teaming devices, it is necessary to disable LRO not only
      on current slaves at the moment when dev_disable_lro() is called but
      also on any slave (port) added later.
      
      v2: use lower device links for all devices (including vlan and macvlan)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbe168ba
  8. 11 11月, 2014 1 次提交
    • E
      net: gro: add a per device gro flush timer · 3b47d303
      Eric Dumazet 提交于
      Tuning coalescing parameters on NIC can be really hard.
      
      Servers can handle both bulk and RPC like traffic, with conflicting
      goals : bulk flows want as big GRO packets as possible, RPC want minimal
      latencies.
      
      To reach big GRO packets on 10Gbe NIC, one can use :
      
      ethtool -C eth0 rx-usecs 4 rx-frames 44
      
      But this penalizes rpc sessions, with an increase of latencies, up to
      50% in some cases, as NICs generally do not force an interrupt when
      a packet with TCP Push flag is received.
      
      Some NICs do not have an absolute timer, only a timer rearmed for every
      incoming packet.
      
      This patch uses a different strategy : Let GRO stack decides what do do,
      based on traffic pattern.
      
      Packets with Push flag wont be delayed.
      Packets without Push flag might be held in GRO engine, if we keep
      receiving data.
      
      This new mechanism is off by default, and shall be enabled by setting
      /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
      
      To fully enable this mechanism, drivers should use napi_complete_done()
      instead of napi_complete().
      
      Tested:
       Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
      
      Without this feature, we send back about 305,000 ACK per second.
      
      GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
      
      Setting a timer of 2000 nsec is enough to increase GRO packet sizes
      and reduce number of ACK packets. (811/19.2 = 42)
      
      Receiver performs less calls to upper stacks, less wakes up.
      This also reduces cpu usage on the sender, as it receives less ACK
      packets.
      
      Note that reducing number of wakes up increases cpu efficiency, but can
      decrease QPS, as applications wont have the chance to warmup cpu caches
      doing a partial read of RPC requests/answers if they fit in one skb.
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
      0.00      0.50
      
      B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
      0.00      0.50
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b47d303
  9. 06 11月, 2014 1 次提交
  10. 04 11月, 2014 2 次提交
    • P
      netdev, sched/wait: Fix sleeping inside wait event · ff960a73
      Peter Zijlstra 提交于
      rtnl_lock_unregistering*() take rtnl_lock() -- a mutex -- inside a
      wait loop. The wait loop relies on current->state to function, but so
      does mutex_lock(), nesting them makes for the inner to destroy the
      outer state.
      
      Fix this using the new wait_woken() bits.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <cwang@twopensource.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jerry Chu <hkchu@google.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: sfeldma@cumulusnetworks.com <sfeldma@cumulusnetworks.com>
      Cc: stephen hemminger <stephen@networkplumber.org>
      Cc: Tom Gundersen <teg@jklm.no>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Vlad Yasevich <vyasevic@redhat.com>
      Cc: netdev@vger.kernel.org
      Link: http://lkml.kernel.org/r/20141029173110.GE15602@worktop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ff960a73
    • E
      net: less interrupt masking in NAPI · d75b1ade
      Eric Dumazet 提交于
      net_rx_action() can mask irqs a single time to transfert sd->poll_list
      into a private list, for a very short duration.
      
      Then, napi_complete() can avoid masking irqs again,
      and net_rx_action() only needs to mask irq again in slow path.
      
      This patch removes 2 couples of irq mask/unmask per typical NAPI run,
      more if multiple napi were triggered.
      
      Note this also allows to give control back to caller (do_softirq())
      more often, so that other softirq handlers can be called a bit earlier,
      or ksoftirqd can be wakeup earlier under pressure.
      
      This was developed while testing an alternative to RX interrupt
      mitigation to reduce latencies while keeping or improving GRO
      aggregation on fast NIC.
      
      Idea is to test napi->gro_list at the end of a napi->poll() and
      reschedule one NAPI poll, but after servicing a full round of
      softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
      is currently serviced by idle task or ksoftirqd, and resched not needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d75b1ade
  11. 30 10月, 2014 1 次提交
  12. 27 10月, 2014 1 次提交
  13. 16 10月, 2014 1 次提交
    • T
      net: Add ndo_gso_check · 04ffcb25
      Tom Herbert 提交于
      Add ndo_gso_check which a device can define to indicate whether is
      is capable of doing GSO on a packet. This funciton would be called from
      the stack to determine whether software GSO is needed to be done. A
      driver should populate this function if it advertises GSO types for
      which there are combinations that it wouldn't be able to handle. For
      instance a device that performs UDP tunneling might only implement
      support for transparent Ethernet bridging type of inner packets
      or might have limitations on lengths of inner headers.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ffcb25
  14. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  15. 07 10月, 2014 2 次提交
  16. 06 10月, 2014 1 次提交
  17. 04 10月, 2014 1 次提交
    • E
      qdisc: validate skb without holding lock · 55a93b3e
      Eric Dumazet 提交于
      Validation of skb can be pretty expensive :
      
      GSO segmentation and/or checksum computations.
      
      We can do this without holding qdisc lock, so that other cpus
      can queue additional packets.
      
      Trick is that requeued packets were already validated, so we carry
      a boolean so that sch_direct_xmit() can validate a fresh skb list,
      or directly use an old one.
      
      Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
      host.
      
      Turning TSO on or off had no effect on throughput, only few more cpu
      cycles. Lock contention on qdisc lock disappeared.
      
      Same if disabling TX checksum offload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55a93b3e
  18. 28 9月, 2014 1 次提交
  19. 27 9月, 2014 1 次提交
  20. 26 9月, 2014 1 次提交
  21. 23 9月, 2014 1 次提交
  22. 16 9月, 2014 1 次提交
  23. 14 9月, 2014 3 次提交
  24. 04 9月, 2014 1 次提交
    • J
      qdisc: validate frames going through the direct_xmit path · 1f59533f
      Jesper Dangaard Brouer 提交于
      In commit 50cbe9ab ("net: Validate xmit SKBs right when we
      pull them out of the qdisc") the validation code was moved out of
      dev_hard_start_xmit and into dequeue_skb.
      
      However this overlooked the fact that we do not always enqueue
      the skb onto a qdisc. First situation is if qdisc have flag
      TCQ_F_CAN_BYPASS and qdisc is empty.  Second situation is if
      there is no qdisc on the device, which is a common case for
      software devices.
      
      Originally spotted and inital patch by Alexander Duyck.
      As a result Alex was seeing issues trying to connect to a
      vhost_net interface after commit 50cbe9ab was applied.
      
      Added a call to validate_xmit_skb() in __dev_xmit_skb(), in the
      code path for qdiscs with TCQ_F_CAN_BYPASS flag, and in
      __dev_queue_xmit() when no qdisc.
      
      Also handle the error situation where dev_hard_start_xmit() could
      return a skb list, and does not return dev_xmit_complete(rc) and
      falls through to the kfree_skb(), in that situation it should
      call kfree_skb_list().
      
      Fixes:  50cbe9ab ("net: Validate xmit SKBs right when we pull them out of the qdisc")
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f59533f
  25. 02 9月, 2014 4 次提交