1. 04 12月, 2015 1 次提交
  2. 21 11月, 2015 1 次提交
  3. 19 11月, 2015 9 次提交
    • E
      net: provide generic busy polling to all NAPI drivers · 93d05d4a
      Eric Dumazet 提交于
      NAPI drivers no longer need to observe a particular protocol
      to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)
      
      napi_hash_add() and napi_hash_del() are automatically called
      from core networking stack, respectively from
      netif_napi_add() and netif_napi_del()
      
      This patch depends on free_netdev() and netif_napi_del() being
      called from process context, which seems to be the norm.
      
      Drivers might still prefer to call napi_hash_del() on their
      own, since they might combine all the rcu grace periods into
      a single one, knowing their NAPI structures lifetime, while
      core networking stack has no idea of a possible combining.
      
      Once this patch proves to not bring serious regressions,
      we will cleanup drivers to either remove napi_hash_del()
      or provide appropriate rcu grace periods combining.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d05d4a
    • E
      net: napi_hash_del() returns a boolean status · 34cbe27e
      Eric Dumazet 提交于
      napi_hash_del() will soon be used from both drivers (if they want)
      or core networking stack.
      
      Callers are responsibles to ensure an RCU grace period is respected
      before freeing napi structure : napi_hash_del() can signal if
      this RCU grace period is needed or not.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34cbe27e
    • E
      net: move napi_hash[] into read mostly section · 6180d9de
      Eric Dumazet 提交于
      We do not often add/delete a napi context.
      Moving napi_hash[] into read mostly section avoids potential false sharing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6180d9de
    • E
      net: add netif_tx_napi_add() · d64b5e85
      Eric Dumazet 提交于
      netif_tx_napi_add() is a variant of netif_napi_add()
      
      It should be used by drivers that use a napi structure
      to exclusively poll TX.
      
      We do not want to add this kind of napi in napi_hash[] in following
      patches, adding generic busy polling to all NAPI drivers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d64b5e85
    • E
      net: move skb_mark_napi_id() into core networking stack · 93f93a44
      Eric Dumazet 提交于
      We would like to automatically provide busy polling support
      to all NAPI drivers, without them having to implement anything.
      
      skb_mark_napi_id() can be called from napi_gro_receive() and
      napi_get_frags().
      
      Few drivers are still calling skb_mark_napi_id() because
      they use netif_receive_skb(). They should eventually call
      napi_gro_receive() instead. I will leave this to drivers
      maintainers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93f93a44
    • E
      net: network drivers no longer need to implement ndo_busy_poll() · ce6aea93
      Eric Dumazet 提交于
      Instead of having to implement complex ndo_busy_poll() method,
      drivers can simply rely on NAPI poll logic.
      
      Busy polling gains are mainly coming from polling itself,
      not on exact details on how we poll the device.
      
      ndo_busy_poll() if implemented can avoid touching
      napi state, but it adds extra synchronization between
      normal napi->poll() and busy poll handler, slowing down
      the common path (non busy polling) with extra atomic operations.
      In practice few drivers ever got busy poll because of the complexity.
      
      We could go one step further, and make busy polling
      available for all NAPI drivers, but this would require
      that all netif_napi_del() calls are done in process context
      so that we can call synchronize_rcu().
      Full audit would be required.
      
      Before this is done, a driver still needs to call :
      
      - skb_mark_napi_id() for each skb provided to the stack.
      - napi_hash_add() and napi_hash_del() to allocate a napi_id per napi struct.
      - Make sure RCU grace period is respected after napi_hash_del() before
        memory containing napi structure is freed.
      
      Followup patch implements busy poll for mlx5 driver as an example.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce6aea93
    • E
      net: allow BH servicing in sk_busy_loop() · 2a028ecb
      Eric Dumazet 提交于
      Instead of blocking BH in whole sk_busy_loop(), block them
      only around ->ndo_busy_poll() calls.
      
      This has many benefits.
      
      1) allow tunneled traffic to use busy poll as well as native traffic.
         Tunnels handlers usually call netif_rx() and depend on net_rx_action()
         being run (from sofirq handler)
      
      2) allow RFS/RPS being used (sending IPI to other cpus if needed)
      
      3) use the 'lets burn cpu cycles' budget to do useful work
         (like TX completions, timers, RCU callbacks...)
      
      4) reduce BH latencies, making busy poll a better citizen.
      
      Tested:
      
      Tested with SIT tunnel
      
      lpaa5:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    37373.93
      16384  87380
      
      Now enable busy poll on both hosts
      
      lpaa5:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa6:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    58314.77
      16384  87380
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a028ecb
    • E
      net: un-inline sk_busy_loop() · 02d62e86
      Eric Dumazet 提交于
      There is really little gain from inlining this big function.
      We'll soon make it even bigger in following patches.
      
      This means we no longer need to export napi_by_id()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d62e86
    • E
      net: better skb->sender_cpu and skb->napi_id cohabitation · 52bd2d62
      Eric Dumazet 提交于
      skb->sender_cpu and skb->napi_id share a common storage,
      and we had various bugs about this.
      
      We had to call skb_sender_cpu_clear() in some places to
      not leave a prior skb->napi_id and fool netdev_pick_tx()
      
      As suggested by Alexei, we could split the space so that
      these errors can not happen.
      
      0 value being reserved as the common (not initialized) value,
      let's reserve [1 .. NR_CPUS] range for valid sender_cpu,
      and [NR_CPUS+1 .. ~0U] for valid napi_id.
      
      This will allow proper busy polling support over tunnels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52bd2d62
  4. 18 11月, 2015 1 次提交
  5. 17 11月, 2015 3 次提交
  6. 05 11月, 2015 1 次提交
    • J
      net/core: ensure features get disabled on new lower devs · e7868a85
      Jarod Wilson 提交于
      With moving netdev_sync_lower_features() after the .ndo_set_features
      calls, I neglected to verify that devices added *after* a flag had been
      disabled on an upper device were properly added with that flag disabled as
      well. This currently happens, because we exit __netdev_update_features()
      when we see dev->features == features for the upper dev. We can retain the
      optimization of leaving without calling .ndo_set_features with a bit of
      tweaking and a goto here.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: netdev@vger.kernel.org
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7868a85
  7. 04 11月, 2015 1 次提交
    • J
      net/core: fix for_each_netdev_feature · 5ba3f7d6
      Jarod Wilson 提交于
      As pointed out by Nikolay and further explained by Geert, the initial
      for_each_netdev_feature macro was broken, as feature would get set outside
      of the block of code it was intended to run in, thus only ever working for
      the first feature bit in the mask. While less pretty this way, this is
      tested and confirmed functional with multiple feature bits set in
      NETIF_F_UPPER_DISABLES.
      
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      ...
      [  242.761394] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [  243.552178] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  244.353978] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [  245.147420] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      [root@dell-per730-01 ~]# ethtool -K bond0 gro off
      ...
      [  251.925645] bond0: Disabling feature 0x0000000000004000 on lower dev p5p2.
      [  252.713693] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  253.499085] bond0: Disabling feature 0x0000000000004000 on lower dev p5p1.
      [  254.290922] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: Geert Uytterhoeven <geert@linux-m68k.org>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba3f7d6
  8. 03 11月, 2015 1 次提交
    • J
      net/core: generic support for disabling netdev features down stack · fd867d51
      Jarod Wilson 提交于
      There are some netdev features, which when disabled on an upper device,
      such as a bonding master or a bridge, must be disabled and cannot be
      re-enabled on underlying devices.
      
      This is a rework of an earlier more heavy-handed appraoch, which simply
      disables and prevents re-enabling of netdev features listed in a new
      define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
      device that disables a flag in that feature mask, the disabling will
      propagate down the stack, and any lower device that has any upper device
      with one of those flags disabled should not be able to enable said flag.
      
      Initially, only LRO is included for proof of concept, and because this
      code effectively does the same thing as dev_disable_lro(), though it will
      also activate from the ethtool path, which was one of the goals here.
      
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: off
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: off
      
      dmesg dump:
      
      [ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      This has been successfully tested with bnx2x, qlcnic and netxen network
      cards as slaves in a bond interface. Turning LRO on or off on the master
      also turns it on or off on each of the slaves, new slaves are added with
      LRO in the same state as the master, and LRO can't be toggled on the
      slaves.
      
      Also, this should largely remove the need for dev_disable_lro(), and most,
      if not all, of its call sites can be replaced by simply making sure
      NETIF_F_LRO isn't included in the relevant device's feature flags.
      
      Note that this patch is driven by bug reports from users saying it was
      confusing that bonds and slaves had different settings for the same
      features, and while it won't be 100% in sync if a lower device doesn't
      support a feature like LRO, I think this is a good step in the right
      direction.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd867d51
  9. 23 10月, 2015 1 次提交
    • P
      openvswitch: Fix egress tunnel info. · fc4099f1
      Pravin B Shelar 提交于
      While transitioning to netdev based vport we broke OVS
      feature which allows user to retrieve tunnel packet egress
      information for lwtunnel devices.  Following patch fixes it
      by introducing ndo operation to get the tunnel egress info.
      Same ndo operation can be used for lwtunnel devices and compat
      ovs-tnl-vport devices. So after adding such device operation
      we can remove similar operation from ovs-vport.
      
      Fixes: 614732ea ("openvswitch: Use regular VXLAN net_device device").
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc4099f1
  10. 16 10月, 2015 1 次提交
  11. 05 10月, 2015 1 次提交
  12. 26 9月, 2015 1 次提交
  13. 24 9月, 2015 1 次提交
    • N
      netpoll: Close race condition between poll_one_napi and napi_disable · 2d8bff12
      Neil Horman 提交于
      Drivers might call napi_disable while not holding the napi instance poll_lock.
      In those instances, its possible for a race condition to exist between
      poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
      NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
      the following may happen:
      
      CPU0				CPU1
      ndo_tx_timeout			napi_poll_dev
       napi_disable			 poll_one_napi
        test_and_set_bit (ret 0)
      				  test_bit (ret 1)
         reset adapter		   napi_poll_routine
      
      If the adapter gets a tx timeout without a napi instance scheduled, its possible
      for the adapter to think it has exclusive access to the hardware  (as the napi
      instance is now scheduled via the napi_disable call), while the netpoll code
      thinks there is simply work to do.  The result is parallel hardware access
      leading to corrupt data structures in the driver, and a crash.
      
      Additionaly, there is another, more critical race between netpoll and
      napi_disable.  The disabled napi state is actually identical to the scheduled
      state for a given napi instance.  The implication being that, if a napi instance
      is disabled, a netconsole instance would see the napi state of the device as
      having been scheduled, and poll it, likely while the driver was dong something
      requiring exclusive access.  In the case above, its fairly clear that not having
      the rings in a state ready to be polled will cause any number of crashes.
      
      The fix should be pretty easy.  netpoll uses its own bit to indicate that that
      the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
      We can just gate disabling on that bit as well as the sched bit.  That should
      prevent netpoll from conducting a napi poll if we convert its set bit to a
      test_and_set_bit operation to provide mutual exclusion
      
      Change notes:
      V2)
      	Remove a trailing whtiespace
      	Resubmit with proper subject prefix
      
      V3)
      	Clean up spacing nits
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jmaxwell@redhat.com
      Tested-by: jmaxwell@redhat.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8bff12
  14. 18 9月, 2015 4 次提交
    • A
      bpf: add bpf_redirect() helper · 27b29f63
      Alexei Starovoitov 提交于
      Existing bpf_clone_redirect() helper clones skb before redirecting
      it to RX or TX of destination netdev.
      Introduce bpf_redirect() helper that does that without cloning.
      
      Benchmarked with two hosts using 10G ixgbe NICs.
      One host is doing line rate pktgen.
      Another host is configured as:
      $ tc qdisc add dev $dev ingress
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
      so it receives the packet on $dev and immediately xmits it on $dev + 1
      The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
      that does bpf_clone_redirect() and performance is 2.0 Mpps
      
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
      which is using bpf_redirect() - 2.4 Mpps
      
      and using cls_bpf with integrated actions as:
      $ tc filter add dev $dev root pref 10 \
        bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
      performance is 2.5 Mpps
      
      To summarize:
      u32+act_bpf using clone_redirect - 2.0 Mpps
      u32+act_bpf using redirect - 2.4 Mpps
      cls_bpf using redirect - 2.5 Mpps
      
      For comparison linux bridge in this setup is doing 2.1 Mpps
      and ixgbe rx + drop in ip_rcv - 7.8 Mpps
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b29f63
    • E
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman 提交于
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4b51f0
    • E
      bridge: Add br_netif_receive_skb remove netif_receive_skb_sk · 04eb4489
      Eric W. Biederman 提交于
      netif_receive_skb_sk is only called once in the bridge code, replace
      it with a bridge specific function that calls netif_receive_skb.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04eb4489
    • E
      net: Remove dev_queue_xmit_sk · 2b4aa3ce
      Eric W. Biederman 提交于
      A function with weird arguments that it will never use to accomdate a
      netfilter callback prototype is absolutely in the core of the
      networking stack.  Frankly it does not make sense and it causes a lot
      of confusion as to why arguments that are never used are being passed
      to the function.
      
      As I am preparing to make a second change to arguments to the okfn even
      the names stops making sense.
      
      As I have removed the two callers of this function remove this confusion
      from the networking stack.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b4aa3ce
  15. 31 8月, 2015 1 次提交
  16. 28 8月, 2015 3 次提交
  17. 19 8月, 2015 1 次提交
    • P
      net: warn if drivers set tx_queue_len = 0 · 906470c1
      Phil Sutter 提交于
      Due to the introduction of IFF_NO_QUEUE, there is a better way for
      drivers to indicate that no qdisc should be attached by default. Though,
      the old convention can't be dropped since ignoring that setting would
      break drivers still using it. Instead, add a warning so out-of-tree
      driver maintainers get a chance to adjust their code before we finally
      get rid of any special handling of tx_queue_len == 0.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      906470c1
  18. 27 7月, 2015 1 次提交
  19. 22 7月, 2015 1 次提交
    • T
      dst: Metadata destinations · f38a9eb1
      Thomas Graf 提交于
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38a9eb1
  20. 21 7月, 2015 1 次提交
  21. 16 7月, 2015 1 次提交
  22. 11 7月, 2015 2 次提交
    • J
      net: call rcu_read_lock early in process_backlog · 2c17d27c
      Julian Anastasov 提交于
      Incoming packet should be either in backlog queue or
      in RCU read-side section. Otherwise, the final sequence of
      flush_backlog() and synchronize_net() may miss packets
      that can run without device reference:
      
      CPU 1                  CPU 2
                             skb->dev: no reference
                             process_backlog:__skb_dequeue
                             process_backlog:local_irq_enable
      
      on_each_cpu for
      flush_backlog =>       IPI(hardirq): flush_backlog
                             - packet not found in backlog
      
                             CPU delayed ...
      synchronize_net
      - no ongoing RCU
      read-side sections
      
      netdev_run_todo,
      rcu_barrier: no
      ongoing callbacks
                             __netif_receive_skb_core:rcu_read_lock
                             - too late
      free dev
                             process packet for freed dev
      
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c17d27c
    • J
      net: do not process device backlog during unregistration · e9e4dd32
      Julian Anastasov 提交于
      commit 381c759d ("ipv4: Avoid crashing in ip_error")
      fixes a problem where processed packet comes from device
      with destroyed inetdev (dev->ip_ptr). This is not expected
      because inetdev_destroy is called in NETDEV_UNREGISTER
      phase and packets should not be processed after
      dev_close_many() and synchronize_net(). Above fix is still
      required because inetdev_destroy can be called for other
      reasons. But it shows the real problem: backlog can keep
      packets for long time and they do not hold reference to
      device. Such packets are then delivered to upper levels
      at the same time when device is unregistered.
      Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
      accounts all packets from backlog but before that some packets
      continue to be delivered to upper levels long after the
      synchronize_net call which is supposed to wait the last
      ones. Also, as Eric pointed out, processed packets, mostly
      from other devices, can continue to add new packets to backlog.
      
      Fix the problem by moving flush_backlog early, after the
      device driver is stopped and before the synchronize_net() call.
      Then use netif_running check to make sure we do not add more
      packets to backlog. We have to do it in enqueue_to_backlog
      context when the local IRQ is disabled. As result, after the
      flush_backlog and synchronize_net sequence all packets
      should be accounted.
      
      Thanks to Eric W. Biederman for the test script and his
      valuable feedback!
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9e4dd32
  23. 09 7月, 2015 2 次提交