1. 11 5月, 2015 2 次提交
    • D
      net: sched: further simplify handle_ing · d2788d34
      Daniel Borkmann 提交于
      Ingress qdisc has no other purpose than calling into tc_classify()
      that executes attached classifier(s) and action(s).
      
      It has a 1:1 relationship to dev->ingress_queue. After having commit
      087c1a60 ("net: sched: run ingress qdisc without locks") removed
      the central ingress lock, one major contention point is gone.
      
      The extra indirection layers however, are not necessary for calling
      into ingress qdisc. pktgen calling locally into netif_receive_skb()
      with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
      E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
      
      We can redirect the private classifier list to the netdev directly,
      without changing any classifier API bits (!) and execute on that from
      handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
      ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
      is also not applicable, ingress_cl_list provides similar behaviour.
      In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
      
      One next possible step is the removal of the dev's ingress (dummy)
      netdev_queue, and to only have the list member in the netdevice
      itself.
      
      Note, the filter chain is RCU protected and individual filter elements
      are being kfree'd by sched subsystem after RCU grace period. RCU read
      lock is being held by __netif_receive_skb_core().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2788d34
    • D
      net: sched: consolidate handle_ing and ing_filter · c9e99fd0
      Daniel Borkmann 提交于
      Given quite some code has been removed from ing_filter(), we can just
      consolidate that function into handle_ing() and get rid of a few
      instructions at the same time.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9e99fd0
  2. 05 5月, 2015 1 次提交
  3. 04 5月, 2015 1 次提交
  4. 27 4月, 2015 1 次提交
    • E
      net: rfs: fix crash in get_rps_cpus() · a31196b0
      Eric Dumazet 提交于
      Commit 567e4b79 ("net: rfs: add hash collision detection") had one
      mistake :
      
      RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
      and get_rps_cpu(), as @next_cpu was the result of an AND with
      rps_cpu_mask
      
      This bug showed up on a host with 72 cpus :
      next_cpu was 0x7f, and the code was trying to access percpu data of an
      non existent cpu.
      
      In a follow up patch, we might get rid of compares against nr_cpu_ids,
      if we init the tables with 0. This is silly to test for a very unlikely
      condition that exists only shortly after table initialization, as
      we got rid of rps_reset_sock_flow() and similar functions that were
      writing this RPS_NO_CPU magic value at flow dismantle : When table is
      old enough, it never contains this value anymore.
      
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a31196b0
  5. 18 4月, 2015 1 次提交
  6. 14 4月, 2015 1 次提交
    • D
      net: use jump label patching for ingress qdisc in __netif_receive_skb_core · 4577139b
      Daniel Borkmann 提交于
      Even if we make use of classifier and actions from the egress
      path, we're going into handle_ing() executing additional code
      on a per-packet cost for ingress qdisc, just to realize that
      nothing is attached on ingress.
      
      Instead, this can just be blinded out as a no-op entirely with
      the use of a static key. On input fast-path, we already make
      use of static keys in various places, e.g. skb time stamping,
      in RPS, etc. It makes sense to not waste time when we're assured
      that no ingress qdisc is attached anywhere.
      
      Enabling/disabling of that code path is being done via two
      helpers, namely net_{inc,dec}_ingress_queue(), that are being
      invoked under RTNL mutex when a ingress qdisc is being either
      initialized or destructed.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4577139b
  7. 08 4月, 2015 1 次提交
    • D
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller 提交于
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7026b1dd
  8. 07 4月, 2015 1 次提交
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990
  9. 03 4月, 2015 3 次提交
  10. 01 4月, 2015 1 次提交
  11. 30 3月, 2015 3 次提交
  12. 24 3月, 2015 1 次提交
    • W
      net: clear skb->priority when forwarding to another netns · 08b4b8ea
      WANG Cong 提交于
      skb->priority can be set for two purposes:
      
      1) With respect to IP TOS field, which is computed by a mask.
      Ususally used for priority qdisc's (pfifo, prio etc.), on TX
      side (we only have ingress qdisc on RX side).
      
      2) Used as a classid or flowid, works in the same way with tc
      classid. What's more, this can even override the classid
      of tc filters.
      
      For case 1), it has been respected within its netns, I don't
      see any point of keeping it for another netns, especially
      when packets will be forwarded to Rx path (no matter from TX
      path or RX path).
      
      For case 2) we care, our applications run inside a netns,
      and we classify the packets by our own filters outside,
      If some application sets this priority, it could bypass
      our filters, therefore clear it when moving out of a netns,
      it makes no sense to bypass tc filters out of its netns.
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08b4b8ea
  13. 19 3月, 2015 2 次提交
    • D
      net: Fix high overhead of vlan sub-device teardown. · 99c4a26a
      David S. Miller 提交于
      When a networking device is taken down that has a non-trivial number
      of VLAN devices configured under it, we eat a full synchronize_net()
      for every such VLAN device.
      
      This is because of the call chain:
      
      	NETDEV_DOWN notifier
      	--> vlan_device_event()
      		--> dev_change_flags()
      		--> __dev_change_flags()
      		--> __dev_close()
      		--> __dev_close_many()
      		--> dev_deactivate_many()
      			--> synchronize_net()
      
      This is kind of rediculous because we already have infrastructure for
      batching doing operation X to a list of net devices so that we only
      incur one sync.
      
      So make use of that by exporting dev_close_many() and adjusting it's
      interfaace so that the caller can fully manage the batch list.  Use
      this in vlan_device_event() and all the overhead goes away.
      Reported-by: NSalam Noureddine <noureddine@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c4a26a
    • D
      net: add support for phys_port_name · db24a904
      David Ahern 提交于
      Similar to port id allow netdevices to specify port names and export
      the name via sysfs. Drivers can implement the netdevice operation to
      assist udev in having sane default names for the devices using the
      rule:
      
      $ cat /etc/udev/rules.d/80-net-setup-link.rules
      SUBSYSTEM=="net", ACTION=="add", ATTR{phys_port_name}!="",
      NAME="$attr{phys_port_name}"
      
      Use of phys_name versus phys_id was suggested-by Jiri Pirko.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db24a904
  14. 13 3月, 2015 1 次提交
  15. 22 2月, 2015 1 次提交
  16. 15 2月, 2015 1 次提交
  17. 12 2月, 2015 1 次提交
  18. 09 2月, 2015 1 次提交
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
  19. 08 2月, 2015 1 次提交
  20. 05 2月, 2015 2 次提交
  21. 31 1月, 2015 1 次提交
  22. 30 1月, 2015 1 次提交
  23. 16 1月, 2015 1 次提交
    • E
      net: rps: fix cpu unplug · ac64da0b
      Eric Dumazet 提交于
      softnet_data.input_pkt_queue is protected by a spinlock that
      we must hold when transferring packets from victim queue to an active
      one. This is because other cpus could still be trying to enqueue packets
      into victim queue.
      
      A second problem is that when we transfert the NAPI poll_list from
      victim to current cpu, we absolutely need to special case the percpu
      backlog, because we do not want to add complex locking to protect
      process_queue : Only owner cpu is allowed to manipulate it, unless cpu
      is offline.
      
      Based on initial patch from Prasad Sodagudi & Subash Abhinov
      Kasiviswanathan.
      
      This version is better because we do not slow down packet processing,
      only make migration safer.
      Reported-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Reported-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac64da0b
  24. 14 1月, 2015 1 次提交
  25. 13 1月, 2015 1 次提交
    • P
      net: allow large number of rx queues · 10595902
      Pankaj Gupta 提交于
      netif_alloc_rx_queues() uses kcalloc() to allocate memory
      for "struct netdev_queue *_rx" array.
      If we are doing large rx queue allocation kcalloc() might
      fail, so this patch does a fallback to vzalloc().
      Similar implementation is done for tx queue allocation in
      netif_alloc_netdev_queues().
      
      We avoid failure of high order memory allocation
      with the help of vzalloc(), this allows us to do large
      rx and tx queue allocation which in turn helps us to
      increase the number of queues in tun.
      
      As vmalloc() adds overhead on a critical network path,
      __GFP_REPEAT flag is used with kzalloc() to do this fallback
      only when really needed.
      Signed-off-by: NPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NDavid Gibson <dgibson@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10595902
  26. 27 12月, 2014 2 次提交
    • J
      net: Generalize ndo_gso_check to ndo_features_check · 5f35227e
      Jesse Gross 提交于
      GSO isn't the only offload feature with restrictions that
      potentially can't be expressed with the current features mechanism.
      Checksum is another although it's a general issue that could in
      theory apply to anything. Even if it may be possible to
      implement these restrictions in other ways, it can result in
      duplicate code or inefficient per-packet behavior.
      
      This generalizes ndo_gso_check so that drivers can remove any
      features that don't make sense for a given packet, similar to
      netif_skb_features(). It also converts existing driver
      restrictions to the new format, completing the work that was
      done to support tunnel protocols since the issues apply to
      checksums as well.
      
      By actually removing features from the set that are used to do
      offloading, it solves another problem with the existing
      interface. In these cases, GSO would run with the original set
      of features and not do anything because it appears that
      segmentation is not required.
      
      CC: Tom Herbert <therbert@google.com>
      CC: Joe Stringer <joestringer@nicira.com>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Hayes Wang <hayeswang@realtek.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Fixes: 04ffcb25 ("net: Add ndo_gso_check")
      Tested-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f35227e
    • J
      net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding · 2c26d34b
      Jay Vosburgh 提交于
      When using VXLAN tunnels and a sky2 device, I have experienced
      checksum failures of the following type:
      
      [ 4297.761899] eth0: hw csum failure
      [...]
      [ 4297.765223] Call Trace:
      [ 4297.765224]  <IRQ>  [<ffffffff8172f026>] dump_stack+0x46/0x58
      [ 4297.765235]  [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
      [ 4297.765238]  [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
      [ 4297.765240]  [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
      [ 4297.765243]  [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
      [ 4297.765246]  [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360
      
      	These are reliably reproduced in a network topology of:
      
      container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch
      
      	When VXLAN encapsulated traffic is received from a similarly
      configured peer, the above warning is generated in the receive
      processing of the encapsulated packet.  Note that the warning is
      associated with the container eth0.
      
              The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
      because the packet is an encapsulated Ethernet frame, the checksum
      generated by the hardware includes the inner protocol and Ethernet
      headers.
      
      	The receive code is careful to update the skb->csum, except in
      __dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
      calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
      to skip over the Ethernet header, but does not update skb->csum when
      doing so.
      
      	This patch resolves the problem by adding a call to
      skb_postpull_rcsum to update the skb->csum after the call to
      eth_type_trans.
      Signed-off-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c26d34b
  27. 24 12月, 2014 6 次提交