1. 17 11月, 2015 2 次提交
  2. 05 11月, 2015 1 次提交
    • J
      net/core: ensure features get disabled on new lower devs · e7868a85
      Jarod Wilson 提交于
      With moving netdev_sync_lower_features() after the .ndo_set_features
      calls, I neglected to verify that devices added *after* a flag had been
      disabled on an upper device were properly added with that flag disabled as
      well. This currently happens, because we exit __netdev_update_features()
      when we see dev->features == features for the upper dev. We can retain the
      optimization of leaving without calling .ndo_set_features with a bit of
      tweaking and a goto here.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: netdev@vger.kernel.org
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7868a85
  3. 04 11月, 2015 1 次提交
    • J
      net/core: fix for_each_netdev_feature · 5ba3f7d6
      Jarod Wilson 提交于
      As pointed out by Nikolay and further explained by Geert, the initial
      for_each_netdev_feature macro was broken, as feature would get set outside
      of the block of code it was intended to run in, thus only ever working for
      the first feature bit in the mask. While less pretty this way, this is
      tested and confirmed functional with multiple feature bits set in
      NETIF_F_UPPER_DISABLES.
      
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      ...
      [  242.761394] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [  243.552178] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  244.353978] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [  245.147420] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      [root@dell-per730-01 ~]# ethtool -K bond0 gro off
      ...
      [  251.925645] bond0: Disabling feature 0x0000000000004000 on lower dev p5p2.
      [  252.713693] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  253.499085] bond0: Disabling feature 0x0000000000004000 on lower dev p5p1.
      [  254.290922] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: Geert Uytterhoeven <geert@linux-m68k.org>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba3f7d6
  4. 03 11月, 2015 1 次提交
    • J
      net/core: generic support for disabling netdev features down stack · fd867d51
      Jarod Wilson 提交于
      There are some netdev features, which when disabled on an upper device,
      such as a bonding master or a bridge, must be disabled and cannot be
      re-enabled on underlying devices.
      
      This is a rework of an earlier more heavy-handed appraoch, which simply
      disables and prevents re-enabling of netdev features listed in a new
      define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
      device that disables a flag in that feature mask, the disabling will
      propagate down the stack, and any lower device that has any upper device
      with one of those flags disabled should not be able to enable said flag.
      
      Initially, only LRO is included for proof of concept, and because this
      code effectively does the same thing as dev_disable_lro(), though it will
      also activate from the ethtool path, which was one of the goals here.
      
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: off
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: off
      
      dmesg dump:
      
      [ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      This has been successfully tested with bnx2x, qlcnic and netxen network
      cards as slaves in a bond interface. Turning LRO on or off on the master
      also turns it on or off on each of the slaves, new slaves are added with
      LRO in the same state as the master, and LRO can't be toggled on the
      slaves.
      
      Also, this should largely remove the need for dev_disable_lro(), and most,
      if not all, of its call sites can be replaced by simply making sure
      NETIF_F_LRO isn't included in the relevant device's feature flags.
      
      Note that this patch is driven by bug reports from users saying it was
      confusing that bonds and slaves had different settings for the same
      features, and while it won't be 100% in sync if a lower device doesn't
      support a feature like LRO, I think this is a good step in the right
      direction.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd867d51
  5. 23 10月, 2015 1 次提交
    • P
      openvswitch: Fix egress tunnel info. · fc4099f1
      Pravin B Shelar 提交于
      While transitioning to netdev based vport we broke OVS
      feature which allows user to retrieve tunnel packet egress
      information for lwtunnel devices.  Following patch fixes it
      by introducing ndo operation to get the tunnel egress info.
      Same ndo operation can be used for lwtunnel devices and compat
      ovs-tnl-vport devices. So after adding such device operation
      we can remove similar operation from ovs-vport.
      
      Fixes: 614732ea ("openvswitch: Use regular VXLAN net_device device").
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc4099f1
  6. 16 10月, 2015 1 次提交
  7. 05 10月, 2015 1 次提交
  8. 26 9月, 2015 1 次提交
  9. 24 9月, 2015 1 次提交
    • N
      netpoll: Close race condition between poll_one_napi and napi_disable · 2d8bff12
      Neil Horman 提交于
      Drivers might call napi_disable while not holding the napi instance poll_lock.
      In those instances, its possible for a race condition to exist between
      poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
      NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
      the following may happen:
      
      CPU0				CPU1
      ndo_tx_timeout			napi_poll_dev
       napi_disable			 poll_one_napi
        test_and_set_bit (ret 0)
      				  test_bit (ret 1)
         reset adapter		   napi_poll_routine
      
      If the adapter gets a tx timeout without a napi instance scheduled, its possible
      for the adapter to think it has exclusive access to the hardware  (as the napi
      instance is now scheduled via the napi_disable call), while the netpoll code
      thinks there is simply work to do.  The result is parallel hardware access
      leading to corrupt data structures in the driver, and a crash.
      
      Additionaly, there is another, more critical race between netpoll and
      napi_disable.  The disabled napi state is actually identical to the scheduled
      state for a given napi instance.  The implication being that, if a napi instance
      is disabled, a netconsole instance would see the napi state of the device as
      having been scheduled, and poll it, likely while the driver was dong something
      requiring exclusive access.  In the case above, its fairly clear that not having
      the rings in a state ready to be polled will cause any number of crashes.
      
      The fix should be pretty easy.  netpoll uses its own bit to indicate that that
      the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
      We can just gate disabling on that bit as well as the sched bit.  That should
      prevent netpoll from conducting a napi poll if we convert its set bit to a
      test_and_set_bit operation to provide mutual exclusion
      
      Change notes:
      V2)
      	Remove a trailing whtiespace
      	Resubmit with proper subject prefix
      
      V3)
      	Clean up spacing nits
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jmaxwell@redhat.com
      Tested-by: jmaxwell@redhat.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8bff12
  10. 18 9月, 2015 4 次提交
    • A
      bpf: add bpf_redirect() helper · 27b29f63
      Alexei Starovoitov 提交于
      Existing bpf_clone_redirect() helper clones skb before redirecting
      it to RX or TX of destination netdev.
      Introduce bpf_redirect() helper that does that without cloning.
      
      Benchmarked with two hosts using 10G ixgbe NICs.
      One host is doing line rate pktgen.
      Another host is configured as:
      $ tc qdisc add dev $dev ingress
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
      so it receives the packet on $dev and immediately xmits it on $dev + 1
      The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
      that does bpf_clone_redirect() and performance is 2.0 Mpps
      
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
      which is using bpf_redirect() - 2.4 Mpps
      
      and using cls_bpf with integrated actions as:
      $ tc filter add dev $dev root pref 10 \
        bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
      performance is 2.5 Mpps
      
      To summarize:
      u32+act_bpf using clone_redirect - 2.0 Mpps
      u32+act_bpf using redirect - 2.4 Mpps
      cls_bpf using redirect - 2.5 Mpps
      
      For comparison linux bridge in this setup is doing 2.1 Mpps
      and ixgbe rx + drop in ip_rcv - 7.8 Mpps
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b29f63
    • E
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman 提交于
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4b51f0
    • E
      bridge: Add br_netif_receive_skb remove netif_receive_skb_sk · 04eb4489
      Eric W. Biederman 提交于
      netif_receive_skb_sk is only called once in the bridge code, replace
      it with a bridge specific function that calls netif_receive_skb.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04eb4489
    • E
      net: Remove dev_queue_xmit_sk · 2b4aa3ce
      Eric W. Biederman 提交于
      A function with weird arguments that it will never use to accomdate a
      netfilter callback prototype is absolutely in the core of the
      networking stack.  Frankly it does not make sense and it causes a lot
      of confusion as to why arguments that are never used are being passed
      to the function.
      
      As I am preparing to make a second change to arguments to the okfn even
      the names stops making sense.
      
      As I have removed the two callers of this function remove this confusion
      from the networking stack.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b4aa3ce
  11. 31 8月, 2015 1 次提交
  12. 28 8月, 2015 3 次提交
  13. 19 8月, 2015 1 次提交
    • P
      net: warn if drivers set tx_queue_len = 0 · 906470c1
      Phil Sutter 提交于
      Due to the introduction of IFF_NO_QUEUE, there is a better way for
      drivers to indicate that no qdisc should be attached by default. Though,
      the old convention can't be dropped since ignoring that setting would
      break drivers still using it. Instead, add a warning so out-of-tree
      driver maintainers get a chance to adjust their code before we finally
      get rid of any special handling of tx_queue_len == 0.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      906470c1
  14. 27 7月, 2015 1 次提交
  15. 22 7月, 2015 1 次提交
    • T
      dst: Metadata destinations · f38a9eb1
      Thomas Graf 提交于
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38a9eb1
  16. 21 7月, 2015 1 次提交
  17. 16 7月, 2015 1 次提交
  18. 11 7月, 2015 2 次提交
    • J
      net: call rcu_read_lock early in process_backlog · 2c17d27c
      Julian Anastasov 提交于
      Incoming packet should be either in backlog queue or
      in RCU read-side section. Otherwise, the final sequence of
      flush_backlog() and synchronize_net() may miss packets
      that can run without device reference:
      
      CPU 1                  CPU 2
                             skb->dev: no reference
                             process_backlog:__skb_dequeue
                             process_backlog:local_irq_enable
      
      on_each_cpu for
      flush_backlog =>       IPI(hardirq): flush_backlog
                             - packet not found in backlog
      
                             CPU delayed ...
      synchronize_net
      - no ongoing RCU
      read-side sections
      
      netdev_run_todo,
      rcu_barrier: no
      ongoing callbacks
                             __netif_receive_skb_core:rcu_read_lock
                             - too late
      free dev
                             process packet for freed dev
      
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c17d27c
    • J
      net: do not process device backlog during unregistration · e9e4dd32
      Julian Anastasov 提交于
      commit 381c759d ("ipv4: Avoid crashing in ip_error")
      fixes a problem where processed packet comes from device
      with destroyed inetdev (dev->ip_ptr). This is not expected
      because inetdev_destroy is called in NETDEV_UNREGISTER
      phase and packets should not be processed after
      dev_close_many() and synchronize_net(). Above fix is still
      required because inetdev_destroy can be called for other
      reasons. But it shows the real problem: backlog can keep
      packets for long time and they do not hold reference to
      device. Such packets are then delivered to upper levels
      at the same time when device is unregistered.
      Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
      accounts all packets from backlog but before that some packets
      continue to be delivered to upper levels long after the
      synchronize_net call which is supposed to wait the last
      ones. Also, as Eric pointed out, processed packets, mostly
      from other devices, can continue to add new packets to backlog.
      
      Fix the problem by moving flush_backlog early, after the
      device driver is stopped and before the synchronize_net() call.
      Then use netif_running check to make sure we do not add more
      packets to backlog. We have to do it in enqueue_to_backlog
      context when the local IRQ is disabled. As result, after the
      flush_backlog and synchronize_net sequence all packets
      should be accounted.
      
      Thanks to Eric W. Biederman for the test script and his
      valuable feedback!
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9e4dd32
  19. 09 7月, 2015 3 次提交
  20. 09 6月, 2015 1 次提交
  21. 02 6月, 2015 1 次提交
  22. 22 5月, 2015 1 次提交
  23. 15 5月, 2015 1 次提交
    • F
      net: core: set qdisc pkt len before tc_classify · 3365495c
      Florian Westphal 提交于
      commit d2788d34 ("net: sched: further simplify handle_ing")
      removed the call to qdisc_enqueue_root().
      
      However, after this removal we no longer set qdisc pkt length.
      This breaks traffic policing on ingress.
      
      This is the minimum fix: set qdisc pkt length before tc_classify.
      
      Only setting the length does remove support for 'stab' on ingress, but
      as Alexei pointed out:
       "Though it was allowed to add qdisc_size_table to ingress, it's useless.
        Nothing takes advantage of recomputed qdisc_pkt_len".
      
      Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
      would result in qdisc_pkt_len_init to no longer get inlined due to the
      additional 2nd call site.
      
      ingress policing is rare and GRO doesn't really work that well with police
      on ingress, as we see packets > mtu and drop skbs that  -- without
      aggregation -- would still have fitted the policier budget.
      Thus to have reliable/smooth ingress policing GRO has to be turned off.
      
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Fixes: d2788d34 ("net: sched: further simplify handle_ing")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3365495c
  24. 14 5月, 2015 4 次提交
    • P
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira 提交于
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e687ad60
    • P
      net: add CONFIG_NET_INGRESS to enable ingress filtering · 1cf51900
      Pablo Neira 提交于
      This new config switch enables the ingress filtering infrastructure that is
      controlled through the ingress_needed static key. This prepares the
      introduction of the Netfilter ingress hook that resides under this unique
      static key.
      
      Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
      problem since this also depends on CONFIG_NET_CLS_ACT.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf51900
    • J
      net: move netdev_pick_tx and dependencies to net/core/dev.c · 638b2a69
      Jiri Pirko 提交于
      next to its user. No relation to flow_dissector so it makes no sense to
      have it in flow_dissector.c
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      638b2a69
    • J
      net: move __skb_tx_hash to dev.c · 5605c762
      Jiri Pirko 提交于
      __skb_tx_hash function has no relation to flow_dissect so just move it
      to dev.c
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5605c762
  25. 13 5月, 2015 1 次提交
    • D
      net: deinline netif_tx_stop_all_queues(), remove WARN_ON in netif_tx_stop_queue() · a2029240
      Denys Vlasenko 提交于
      These functions compile to 60 bytes of machine code each.
      With this .config: http://busybox.net/~vda/kernel_config
      there are 617 calls of netif_tx_stop_queue()
      and 49 calls of netif_tx_stop_all_queues() in vmlinux.
      
      To fix this, remove WARN_ON in netif_tx_stop_queue()
      as suggested by davem, and deinline netif_tx_stop_all_queues().
      
      Change in code size is about 20k:
      
         text      data      bss       dec     hex filename
      82426986 22255416 20627456 125309858 77813a2 vmlinux.before
      82406248 22255416 20627456 125289120 777c2a0 vmlinux
      
      gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue
      sometimes:
      
      $ nm --size-sort vmlinux | grep netif_tx_stop_queue | wc -l
      190
      
      ffffffff81b558a8 <netif_tx_stop_queue>:
      ffffffff81b558a8:       55                      push   %rbp
      ffffffff81b558a9:       48 89 e5                mov    %rsp,%rbp
      ffffffff81b558ac:       f0 80 8f e0 01 00 00    lock orb $0x1,0x1e0(%rdi)
      ffffffff81b558b3:       01
      ffffffff81b558b4:       5d                      pop    %rbp
      ffffffff81b558b5:       c3                      retq
      
      This needs additional fixing.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: Joe Perches <joe@perches.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Jiri Pirko <jpirko@redhat.com>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: netfilter-devel@vger.kernel.org
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2029240
  26. 11 5月, 2015 2 次提交
    • D
      net: sched: further simplify handle_ing · d2788d34
      Daniel Borkmann 提交于
      Ingress qdisc has no other purpose than calling into tc_classify()
      that executes attached classifier(s) and action(s).
      
      It has a 1:1 relationship to dev->ingress_queue. After having commit
      087c1a60 ("net: sched: run ingress qdisc without locks") removed
      the central ingress lock, one major contention point is gone.
      
      The extra indirection layers however, are not necessary for calling
      into ingress qdisc. pktgen calling locally into netif_receive_skb()
      with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
      E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
      
      We can redirect the private classifier list to the netdev directly,
      without changing any classifier API bits (!) and execute on that from
      handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
      ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
      is also not applicable, ingress_cl_list provides similar behaviour.
      In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
      
      One next possible step is the removal of the dev's ingress (dummy)
      netdev_queue, and to only have the list member in the netdevice
      itself.
      
      Note, the filter chain is RCU protected and individual filter elements
      are being kfree'd by sched subsystem after RCU grace period. RCU read
      lock is being held by __netif_receive_skb_core().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2788d34
    • D
      net: sched: consolidate handle_ing and ing_filter · c9e99fd0
      Daniel Borkmann 提交于
      Given quite some code has been removed from ing_filter(), we can just
      consolidate that function into handle_ing() and get rid of a few
      instructions at the same time.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9e99fd0
  27. 05 5月, 2015 1 次提交
    • V
      net: core: Correct an over-stringent device loop detection. · d66bf7dd
      Vlad Yasevich 提交于
      The code in __netdev_upper_dev_link() has an over-stringent
      loop detection logic that actually prevents valid configurations
      from working correctly.
      
      In particular, the logic returns an error if an upper device
      is already in the list of all upper devices for a given dev.
      This particular check seems to be a overzealous as it disallows
      perfectly valid configurations.  For example:
        # ip l a link eth0 name eth0.10 type vlan id 10
        # ip l a dev br0 typ bridge
        # ip l s eth0.10 master br0
        # ip l s eth0 master br0  <--- Will fail
      
      If you switch the last two commands (add eth0 first), then both
      will succeed.  If after that, you remove eth0 and try to re-add
      it, it will fail!
      
      It appears to be enough to simply check adj_list to keeps things
      safe.
      
      I've tried stacking multiple devices multiple times in all different
      combinations, and either rx_handler registration prevented the stacking
      of the device linking cought the error.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d66bf7dd