1. 11 7月, 2015 1 次提交
    • J
      net: do not process device backlog during unregistration · e9e4dd32
      Julian Anastasov 提交于
      commit 381c759d ("ipv4: Avoid crashing in ip_error")
      fixes a problem where processed packet comes from device
      with destroyed inetdev (dev->ip_ptr). This is not expected
      because inetdev_destroy is called in NETDEV_UNREGISTER
      phase and packets should not be processed after
      dev_close_many() and synchronize_net(). Above fix is still
      required because inetdev_destroy can be called for other
      reasons. But it shows the real problem: backlog can keep
      packets for long time and they do not hold reference to
      device. Such packets are then delivered to upper levels
      at the same time when device is unregistered.
      Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
      accounts all packets from backlog but before that some packets
      continue to be delivered to upper levels long after the
      synchronize_net call which is supposed to wait the last
      ones. Also, as Eric pointed out, processed packets, mostly
      from other devices, can continue to add new packets to backlog.
      
      Fix the problem by moving flush_backlog early, after the
      device driver is stopped and before the synchronize_net() call.
      Then use netif_running check to make sure we do not add more
      packets to backlog. We have to do it in enqueue_to_backlog
      context when the local IRQ is disabled. As result, after the
      flush_backlog and synchronize_net sequence all packets
      should be accounted.
      
      Thanks to Eric W. Biederman for the test script and his
      valuable feedback!
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9e4dd32
  2. 09 7月, 2015 2 次提交
  3. 09 6月, 2015 1 次提交
  4. 02 6月, 2015 1 次提交
  5. 22 5月, 2015 1 次提交
  6. 15 5月, 2015 1 次提交
    • F
      net: core: set qdisc pkt len before tc_classify · 3365495c
      Florian Westphal 提交于
      commit d2788d34 ("net: sched: further simplify handle_ing")
      removed the call to qdisc_enqueue_root().
      
      However, after this removal we no longer set qdisc pkt length.
      This breaks traffic policing on ingress.
      
      This is the minimum fix: set qdisc pkt length before tc_classify.
      
      Only setting the length does remove support for 'stab' on ingress, but
      as Alexei pointed out:
       "Though it was allowed to add qdisc_size_table to ingress, it's useless.
        Nothing takes advantage of recomputed qdisc_pkt_len".
      
      Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
      would result in qdisc_pkt_len_init to no longer get inlined due to the
      additional 2nd call site.
      
      ingress policing is rare and GRO doesn't really work that well with police
      on ingress, as we see packets > mtu and drop skbs that  -- without
      aggregation -- would still have fitted the policier budget.
      Thus to have reliable/smooth ingress policing GRO has to be turned off.
      
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Fixes: d2788d34 ("net: sched: further simplify handle_ing")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3365495c
  7. 14 5月, 2015 4 次提交
    • P
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira 提交于
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e687ad60
    • P
      net: add CONFIG_NET_INGRESS to enable ingress filtering · 1cf51900
      Pablo Neira 提交于
      This new config switch enables the ingress filtering infrastructure that is
      controlled through the ingress_needed static key. This prepares the
      introduction of the Netfilter ingress hook that resides under this unique
      static key.
      
      Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
      problem since this also depends on CONFIG_NET_CLS_ACT.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf51900
    • J
      net: move netdev_pick_tx and dependencies to net/core/dev.c · 638b2a69
      Jiri Pirko 提交于
      next to its user. No relation to flow_dissector so it makes no sense to
      have it in flow_dissector.c
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      638b2a69
    • J
      net: move __skb_tx_hash to dev.c · 5605c762
      Jiri Pirko 提交于
      __skb_tx_hash function has no relation to flow_dissect so just move it
      to dev.c
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5605c762
  8. 13 5月, 2015 1 次提交
    • D
      net: deinline netif_tx_stop_all_queues(), remove WARN_ON in netif_tx_stop_queue() · a2029240
      Denys Vlasenko 提交于
      These functions compile to 60 bytes of machine code each.
      With this .config: http://busybox.net/~vda/kernel_config
      there are 617 calls of netif_tx_stop_queue()
      and 49 calls of netif_tx_stop_all_queues() in vmlinux.
      
      To fix this, remove WARN_ON in netif_tx_stop_queue()
      as suggested by davem, and deinline netif_tx_stop_all_queues().
      
      Change in code size is about 20k:
      
         text      data      bss       dec     hex filename
      82426986 22255416 20627456 125309858 77813a2 vmlinux.before
      82406248 22255416 20627456 125289120 777c2a0 vmlinux
      
      gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue
      sometimes:
      
      $ nm --size-sort vmlinux | grep netif_tx_stop_queue | wc -l
      190
      
      ffffffff81b558a8 <netif_tx_stop_queue>:
      ffffffff81b558a8:       55                      push   %rbp
      ffffffff81b558a9:       48 89 e5                mov    %rsp,%rbp
      ffffffff81b558ac:       f0 80 8f e0 01 00 00    lock orb $0x1,0x1e0(%rdi)
      ffffffff81b558b3:       01
      ffffffff81b558b4:       5d                      pop    %rbp
      ffffffff81b558b5:       c3                      retq
      
      This needs additional fixing.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: Joe Perches <joe@perches.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Jiri Pirko <jpirko@redhat.com>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: netfilter-devel@vger.kernel.org
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2029240
  9. 11 5月, 2015 2 次提交
    • D
      net: sched: further simplify handle_ing · d2788d34
      Daniel Borkmann 提交于
      Ingress qdisc has no other purpose than calling into tc_classify()
      that executes attached classifier(s) and action(s).
      
      It has a 1:1 relationship to dev->ingress_queue. After having commit
      087c1a60 ("net: sched: run ingress qdisc without locks") removed
      the central ingress lock, one major contention point is gone.
      
      The extra indirection layers however, are not necessary for calling
      into ingress qdisc. pktgen calling locally into netif_receive_skb()
      with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
      E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
      
      We can redirect the private classifier list to the netdev directly,
      without changing any classifier API bits (!) and execute on that from
      handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
      ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
      is also not applicable, ingress_cl_list provides similar behaviour.
      In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
      
      One next possible step is the removal of the dev's ingress (dummy)
      netdev_queue, and to only have the list member in the netdevice
      itself.
      
      Note, the filter chain is RCU protected and individual filter elements
      are being kfree'd by sched subsystem after RCU grace period. RCU read
      lock is being held by __netif_receive_skb_core().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2788d34
    • D
      net: sched: consolidate handle_ing and ing_filter · c9e99fd0
      Daniel Borkmann 提交于
      Given quite some code has been removed from ing_filter(), we can just
      consolidate that function into handle_ing() and get rid of a few
      instructions at the same time.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9e99fd0
  10. 05 5月, 2015 2 次提交
  11. 04 5月, 2015 1 次提交
  12. 27 4月, 2015 1 次提交
    • E
      net: rfs: fix crash in get_rps_cpus() · a31196b0
      Eric Dumazet 提交于
      Commit 567e4b79 ("net: rfs: add hash collision detection") had one
      mistake :
      
      RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
      and get_rps_cpu(), as @next_cpu was the result of an AND with
      rps_cpu_mask
      
      This bug showed up on a host with 72 cpus :
      next_cpu was 0x7f, and the code was trying to access percpu data of an
      non existent cpu.
      
      In a follow up patch, we might get rid of compares against nr_cpu_ids,
      if we init the tables with 0. This is silly to test for a very unlikely
      condition that exists only shortly after table initialization, as
      we got rid of rps_reset_sock_flow() and similar functions that were
      writing this RPS_NO_CPU magic value at flow dismantle : When table is
      old enough, it never contains this value anymore.
      
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a31196b0
  13. 18 4月, 2015 1 次提交
  14. 14 4月, 2015 1 次提交
    • D
      net: use jump label patching for ingress qdisc in __netif_receive_skb_core · 4577139b
      Daniel Borkmann 提交于
      Even if we make use of classifier and actions from the egress
      path, we're going into handle_ing() executing additional code
      on a per-packet cost for ingress qdisc, just to realize that
      nothing is attached on ingress.
      
      Instead, this can just be blinded out as a no-op entirely with
      the use of a static key. On input fast-path, we already make
      use of static keys in various places, e.g. skb time stamping,
      in RPS, etc. It makes sense to not waste time when we're assured
      that no ingress qdisc is attached anywhere.
      
      Enabling/disabling of that code path is being done via two
      helpers, namely net_{inc,dec}_ingress_queue(), that are being
      invoked under RTNL mutex when a ingress qdisc is being either
      initialized or destructed.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4577139b
  15. 08 4月, 2015 1 次提交
    • D
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller 提交于
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7026b1dd
  16. 07 4月, 2015 1 次提交
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990
  17. 03 4月, 2015 3 次提交
  18. 01 4月, 2015 1 次提交
  19. 30 3月, 2015 3 次提交
  20. 24 3月, 2015 1 次提交
    • W
      net: clear skb->priority when forwarding to another netns · 08b4b8ea
      WANG Cong 提交于
      skb->priority can be set for two purposes:
      
      1) With respect to IP TOS field, which is computed by a mask.
      Ususally used for priority qdisc's (pfifo, prio etc.), on TX
      side (we only have ingress qdisc on RX side).
      
      2) Used as a classid or flowid, works in the same way with tc
      classid. What's more, this can even override the classid
      of tc filters.
      
      For case 1), it has been respected within its netns, I don't
      see any point of keeping it for another netns, especially
      when packets will be forwarded to Rx path (no matter from TX
      path or RX path).
      
      For case 2) we care, our applications run inside a netns,
      and we classify the packets by our own filters outside,
      If some application sets this priority, it could bypass
      our filters, therefore clear it when moving out of a netns,
      it makes no sense to bypass tc filters out of its netns.
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08b4b8ea
  21. 19 3月, 2015 2 次提交
    • D
      net: Fix high overhead of vlan sub-device teardown. · 99c4a26a
      David S. Miller 提交于
      When a networking device is taken down that has a non-trivial number
      of VLAN devices configured under it, we eat a full synchronize_net()
      for every such VLAN device.
      
      This is because of the call chain:
      
      	NETDEV_DOWN notifier
      	--> vlan_device_event()
      		--> dev_change_flags()
      		--> __dev_change_flags()
      		--> __dev_close()
      		--> __dev_close_many()
      		--> dev_deactivate_many()
      			--> synchronize_net()
      
      This is kind of rediculous because we already have infrastructure for
      batching doing operation X to a list of net devices so that we only
      incur one sync.
      
      So make use of that by exporting dev_close_many() and adjusting it's
      interfaace so that the caller can fully manage the batch list.  Use
      this in vlan_device_event() and all the overhead goes away.
      Reported-by: NSalam Noureddine <noureddine@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c4a26a
    • D
      net: add support for phys_port_name · db24a904
      David Ahern 提交于
      Similar to port id allow netdevices to specify port names and export
      the name via sysfs. Drivers can implement the netdevice operation to
      assist udev in having sane default names for the devices using the
      rule:
      
      $ cat /etc/udev/rules.d/80-net-setup-link.rules
      SUBSYSTEM=="net", ACTION=="add", ATTR{phys_port_name}!="",
      NAME="$attr{phys_port_name}"
      
      Use of phys_name versus phys_id was suggested-by Jiri Pirko.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db24a904
  22. 13 3月, 2015 1 次提交
  23. 22 2月, 2015 1 次提交
  24. 15 2月, 2015 1 次提交
  25. 12 2月, 2015 1 次提交
  26. 09 2月, 2015 1 次提交
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
  27. 08 2月, 2015 1 次提交
  28. 05 2月, 2015 2 次提交