1. 20 2月, 2016 1 次提交
    • N
      net: make netdev_for_each_lower_dev safe for device removal · cfdd28be
      Nikolay Aleksandrov 提交于
      When I used netdev_for_each_lower_dev in commit bad53162 ("vrf:
      remove slave queue and private slave struct") I thought that it acts
      like netdev_for_each_lower_private and can be used to remove the current
      device from the list while walking, but unfortunately it acts more like
      netdev_for_each_lower_private_rcu and doesn't allow it. The difference
      is where the "iter" points to, right now it points to the current element
      and that makes it impossible to remove it. Change the logic to be
      similar to netdev_for_each_lower_private and make it point to the "next"
      element so we can safely delete the current one. VRF is the only such
      user right now, there's no change for the read-only users.
      
      Here's what can happen now:
      [98423.249858] general protection fault: 0000 [#1] SMP
      [98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng
      sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw
      gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon
      parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button
      9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg
      virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd
      ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy
      virtio_ring virtio scsi_mod [last unloaded: bridge]
      [98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G           O
      4.5.0-rc2+ #81
      [98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.8.1-20150318_183358- 04/01/2014
      [98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti:
      ffff88003428c000
      [98423.256123] RIP: 0010:[<ffffffff81514f3e>]  [<ffffffff81514f3e>]
      netdev_lower_get_next+0x1e/0x30
      [98423.256534] RSP: 0018:ffff88003428f940  EFLAGS: 00010207
      [98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX:
      0000000000000000
      [98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI:
      ffff880054ff90c0
      [98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09:
      0000000000000000
      [98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12:
      ffff88003428f9e0
      [98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15:
      0000000000000001
      [98423.258055] FS:  00007f3d76881700(0000) GS:ffff88005d000000(0000)
      knlGS:0000000000000000
      [98423.258418] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4:
      00000000000406e0
      [98423.258902] Stack:
      [98423.259075]  ffff88003428f960 ffffffffa0442636 0002000100000004
      ffff880054ff9000
      [98423.259647]  ffff88003428f9b0 ffffffff81518205 ffff880054ff9000
      ffff88003428f978
      [98423.260208]  ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0
      ffff880035b35f00
      [98423.260739] Call Trace:
      [98423.260920]  [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf]
      [98423.261156]  [<ffffffff81518205>]
      rollback_registered_many+0x205/0x390
      [98423.261401]  [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70
      [98423.261641]  [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50
      [98423.271557]  [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0
      [98423.271800]  [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90
      [98423.272049]  [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200
      [98423.272279]  [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10
      [98423.272513]  [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40
      [98423.272755]  [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40
      [98423.272983]  [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0
      [98423.273209]  [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40
      [98423.273476]  [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0
      [98423.273710]  [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610
      [98423.273947]  [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70
      [98423.274175]  [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0
      [98423.274416]  [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140
      [98423.274658]  [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210
      [98423.274894]  [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210
      [98423.275130]  [<ffffffff81269611>] ? __fget_light+0x91/0xb0
      [98423.275365]  [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80
      [98423.275595]  [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20
      [98423.275827]  [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66
      90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48
      89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66
      [98423.279639] RIP  [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30
      [98423.279920]  RSP <ffff88003428f940>
      
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Vlad Yasevich <vyasevic@redhat.com>
      Fixes: bad53162 ("vrf: remove slave queue and private slave struct")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfdd28be
  2. 25 1月, 2016 1 次提交
    • A
      net: simplify napi_synchronize() to avoid warnings · facc432f
      Arnd Bergmann 提交于
      The napi_synchronize() function is defined twice: The definition
      for SMP builds waits for other CPUs to be done, while the uniprocessor
      variant just contains a barrier and ignores its argument.
      
      In the mvneta driver, this leads to a warning about an unused variable
      when we lookup the NAPI struct of another CPU and then don't use it:
      
      ethernet/marvell/mvneta.c: In function 'mvneta_percpu_notifier':
      ethernet/marvell/mvneta.c:2910:30: error: unused variable 'other_port' [-Werror=unused-variable]
      
      There are no other CPUs on a UP build, so that code never runs, but
      gcc does not know this.
      
      The nicest solution seems to be to turn the napi_synchronize() helper
      into an inline function for the UP case as well, as that leads gcc to
      not complain about the argument being unused. Once we do that, we can
      also combine the two cases into a single function definition and use
      if(IS_ENABLED()) rather than #ifdef to make it look a bit nicer.
      
      The warning first came up in linux-4.4, but I failed to catch it
      earlier.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: f8642885 ("net: mvneta: Statically assign queues to CPUs")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      facc432f
  3. 12 1月, 2016 1 次提交
  4. 11 1月, 2016 1 次提交
    • D
      net, sched: add clsact qdisc · 1f211a1b
      Daniel Borkmann 提交于
      This work adds a generalization of the ingress qdisc as a qdisc holding
      only classifiers. The clsact qdisc works on ingress, but also on egress.
      In both cases, it's execution happens without taking the qdisc lock, and
      the main difference for the egress part compared to prior version of [1]
      is that this can be applied with _any_ underlying real egress qdisc (also
      classless ones).
      
      Besides solving the use-case of [1], that is, allowing for more programmability
      on assigning skb->priority for the mqprio case that is supported by most
      popular 10G+ NICs, it also opens up a lot more flexibility for other tc
      applications. The main work on classification can already be done at clsact
      egress time if the use-case allows and state stored for later retrieval
      f.e. again in skb->priority with major/minors (which is checked by most
      classful qdiscs before consulting tc_classify()) and/or in other skb fields
      like skb->tc_index for some light-weight post-processing to get to the
      eventual classid in case of a classful qdisc. Another use case is that
      the clsact egress part allows to have a central egress counterpart to
      the ingress classifiers, so that classifiers can easily share state (e.g.
      in cls_bpf via eBPF maps) for ingress and egress.
      
      Currently, default setups like mq + pfifo_fast would require for this to
      use, for example, prio qdisc instead (to get a tc_classify() run) and to
      duplicate the egress classifier for each queue. With clsact, it allows
      for leaving the setup as is, it can additionally assign skb->priority to
      put the skb in one of pfifo_fast's bands and it can share state with maps.
      Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
      w/o the need to perform a skb_dst_force() to hold on to it any longer. In
      lwt case, we can also use this facility to setup dst metadata via cls_bpf
      (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
      that (case of IFF_NO_QUEUE devices, for example).
      
      The realization can be done without any changes to the scheduler core
      framework. All it takes is that we have two a-priori defined minors/child
      classes, where we can mux between ingress and egress classifier list
      (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
      dev->_tx to avoid extra cacheline miss for moderate loads). The egress
      part is a bit similar modelled to handle_ing() and patched to a noop in
      case the functionality is not used. Both handlers are now called
      sch_handle_ingress() and sch_handle_egress(), code sharing among the two
      doesn't seem practical as there are various minor differences in both
      paths, so that making them conditional in a single handler would rather
      slow things down.
      
      Full compatibility to ingress qdisc is provided as well. Since both
      piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
      per netdevice, and thus ingress qdisc specific behaviour can be retained
      for user space. This means, either a user does 'tc qdisc add dev foo ingress'
      and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
      alternative, where both, ingress and egress classifier can be configured
      as in the below example. ingress qdisc supports attaching classifier to any
      minor number whereas clsact has two fixed minors for muxing between the
      lists, therefore to not break user space setups, they are better done as
      two separate qdiscs.
      
      I decided to extend the sch_ingress module with clsact functionality so
      that commonly used code can be reused, the module is being aliased with
      sch_clsact so that it can be auto-loaded properly. Alternative would have been
      to add a flag when initializing ingress to alter its behaviour plus aliasing
      to a different name (as it's more than just ingress). However, the first would
      end up, based on the flag, choosing the new/old behaviour by calling different
      function implementations to handle each anyway, the latter would require to
      register ingress qdisc once again under different alias. So, this really begs
      to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
      by its own that share callbacks used by both.
      
      Example, adding qdisc:
      
         # tc qdisc add dev foo clsact
         # tc qdisc show dev foo
         qdisc mq 0: root
         qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc clsact ffff: parent ffff:fff1
      
      Adding filters (deleting, etc works analogous by specifying ingress/egress):
      
         # tc filter add dev foo ingress bpf da obj bar.o sec ingress
         # tc filter add dev foo egress  bpf da obj bar.o sec egress
         # tc filter show dev foo ingress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
         # tc filter show dev foo egress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
      
      A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
      show an empty list for clsact. Either using the parent names (ingress/egress)
      or specifying the full major/minor will then show the related filter lists.
      
      Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
      
        [1] http://patchwork.ozlabs.org/patch/512949/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f211a1b
  5. 07 1月, 2016 1 次提交
  6. 16 12月, 2015 4 次提交
  7. 07 12月, 2015 1 次提交
  8. 06 12月, 2015 1 次提交
  9. 04 12月, 2015 9 次提交
  10. 03 12月, 2015 1 次提交
  11. 30 11月, 2015 1 次提交
  12. 19 11月, 2015 4 次提交
  13. 18 11月, 2015 1 次提交
    • V
      vlan: Do not put vlan headers back on bridge and macvlan ports · 28f9ee22
      Vlad Yasevich 提交于
      When a vlan is configured with REORDER_HEADER set to 0, the vlan
      header is put back into the packet and makes it appear that
      the vlan header is still there even after it's been processed.
      This posses a problem for bridge and macvlan ports.  The packets
      passed to those device may be forwarded and at the time of the
      forward, vlan headers end up being unexpectedly present.
      
      With the patch, we make sure that we do not put the vlan header
      back (when REORDER_HEADER is 0) if a bridge or macvlan has
      been configured on top of the vlan device.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28f9ee22
  14. 11 11月, 2015 1 次提交
  15. 05 11月, 2015 1 次提交
  16. 25 10月, 2015 1 次提交
  17. 23 10月, 2015 2 次提交
  18. 16 10月, 2015 1 次提交
  19. 07 10月, 2015 1 次提交
  20. 30 9月, 2015 4 次提交
  21. 24 9月, 2015 1 次提交
    • N
      netpoll: Close race condition between poll_one_napi and napi_disable · 2d8bff12
      Neil Horman 提交于
      Drivers might call napi_disable while not holding the napi instance poll_lock.
      In those instances, its possible for a race condition to exist between
      poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
      NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
      the following may happen:
      
      CPU0				CPU1
      ndo_tx_timeout			napi_poll_dev
       napi_disable			 poll_one_napi
        test_and_set_bit (ret 0)
      				  test_bit (ret 1)
         reset adapter		   napi_poll_routine
      
      If the adapter gets a tx timeout without a napi instance scheduled, its possible
      for the adapter to think it has exclusive access to the hardware  (as the napi
      instance is now scheduled via the napi_disable call), while the netpoll code
      thinks there is simply work to do.  The result is parallel hardware access
      leading to corrupt data structures in the driver, and a crash.
      
      Additionaly, there is another, more critical race between netpoll and
      napi_disable.  The disabled napi state is actually identical to the scheduled
      state for a given napi instance.  The implication being that, if a napi instance
      is disabled, a netconsole instance would see the napi state of the device as
      having been scheduled, and poll it, likely while the driver was dong something
      requiring exclusive access.  In the case above, its fairly clear that not having
      the rings in a state ready to be polled will cause any number of crashes.
      
      The fix should be pretty easy.  netpoll uses its own bit to indicate that that
      the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
      We can just gate disabling on that bit as well as the sched bit.  That should
      prevent netpoll from conducting a napi poll if we convert its set bit to a
      test_and_set_bit operation to provide mutual exclusion
      
      Change notes:
      V2)
      	Remove a trailing whtiespace
      	Resubmit with proper subject prefix
      
      V3)
      	Clean up spacing nits
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jmaxwell@redhat.com
      Tested-by: jmaxwell@redhat.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8bff12
  22. 18 9月, 2015 1 次提交
    • E
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman 提交于
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4b51f0