1. 16 6月, 2016 1 次提交
    • E
      net_sched: add the ability to defer skb freeing · 1b5c5493
      Eric Dumazet 提交于
      qdisc are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      This commit adds rtnl_kfree_skbs(), used to queue
      skbs for deferred freeing.
      
      Actual freeing happens right after RTNL is released,
      with appropriate scheduling points.
      
      rtnl_qdisc_drop() can also be used in place
      of disc_drop() when RTNL is held.
      
      qdisc_reset_queue() and __qdisc_reset_queue() get
      the new behavior, so standard qdiscs like pfifo, pfifo_fast...
      have their ->reset() method automatically handled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b5c5493
  2. 11 1月, 2016 1 次提交
    • D
      net, sched: add clsact qdisc · 1f211a1b
      Daniel Borkmann 提交于
      This work adds a generalization of the ingress qdisc as a qdisc holding
      only classifiers. The clsact qdisc works on ingress, but also on egress.
      In both cases, it's execution happens without taking the qdisc lock, and
      the main difference for the egress part compared to prior version of [1]
      is that this can be applied with _any_ underlying real egress qdisc (also
      classless ones).
      
      Besides solving the use-case of [1], that is, allowing for more programmability
      on assigning skb->priority for the mqprio case that is supported by most
      popular 10G+ NICs, it also opens up a lot more flexibility for other tc
      applications. The main work on classification can already be done at clsact
      egress time if the use-case allows and state stored for later retrieval
      f.e. again in skb->priority with major/minors (which is checked by most
      classful qdiscs before consulting tc_classify()) and/or in other skb fields
      like skb->tc_index for some light-weight post-processing to get to the
      eventual classid in case of a classful qdisc. Another use case is that
      the clsact egress part allows to have a central egress counterpart to
      the ingress classifiers, so that classifiers can easily share state (e.g.
      in cls_bpf via eBPF maps) for ingress and egress.
      
      Currently, default setups like mq + pfifo_fast would require for this to
      use, for example, prio qdisc instead (to get a tc_classify() run) and to
      duplicate the egress classifier for each queue. With clsact, it allows
      for leaving the setup as is, it can additionally assign skb->priority to
      put the skb in one of pfifo_fast's bands and it can share state with maps.
      Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
      w/o the need to perform a skb_dst_force() to hold on to it any longer. In
      lwt case, we can also use this facility to setup dst metadata via cls_bpf
      (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
      that (case of IFF_NO_QUEUE devices, for example).
      
      The realization can be done without any changes to the scheduler core
      framework. All it takes is that we have two a-priori defined minors/child
      classes, where we can mux between ingress and egress classifier list
      (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
      dev->_tx to avoid extra cacheline miss for moderate loads). The egress
      part is a bit similar modelled to handle_ing() and patched to a noop in
      case the functionality is not used. Both handlers are now called
      sch_handle_ingress() and sch_handle_egress(), code sharing among the two
      doesn't seem practical as there are various minor differences in both
      paths, so that making them conditional in a single handler would rather
      slow things down.
      
      Full compatibility to ingress qdisc is provided as well. Since both
      piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
      per netdevice, and thus ingress qdisc specific behaviour can be retained
      for user space. This means, either a user does 'tc qdisc add dev foo ingress'
      and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
      alternative, where both, ingress and egress classifier can be configured
      as in the below example. ingress qdisc supports attaching classifier to any
      minor number whereas clsact has two fixed minors for muxing between the
      lists, therefore to not break user space setups, they are better done as
      two separate qdiscs.
      
      I decided to extend the sch_ingress module with clsact functionality so
      that commonly used code can be reused, the module is being aliased with
      sch_clsact so that it can be auto-loaded properly. Alternative would have been
      to add a flag when initializing ingress to alter its behaviour plus aliasing
      to a different name (as it's more than just ingress). However, the first would
      end up, based on the flag, choosing the new/old behaviour by calling different
      function implementations to handle each anyway, the latter would require to
      register ingress qdisc once again under different alias. So, this really begs
      to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
      by its own that share callbacks used by both.
      
      Example, adding qdisc:
      
         # tc qdisc add dev foo clsact
         # tc qdisc show dev foo
         qdisc mq 0: root
         qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc clsact ffff: parent ffff:fff1
      
      Adding filters (deleting, etc works analogous by specifying ingress/egress):
      
         # tc filter add dev foo ingress bpf da obj bar.o sec ingress
         # tc filter add dev foo egress  bpf da obj bar.o sec egress
         # tc filter show dev foo ingress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
         # tc filter show dev foo egress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
      
      A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
      show an empty list for clsact. Either using the parent names (ingress/egress)
      or specifying the full major/minor will then show the related filter lists.
      
      Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
      
        [1] http://patchwork.ozlabs.org/patch/512949/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f211a1b
  3. 09 10月, 2015 1 次提交
  4. 23 6月, 2015 1 次提交
    • S
      switchdev; add VLAN support for port's bridge_getlink · 7d4f8d87
      Scott Feldman 提交于
      One more missing piece of the puzzle.  Add vlan dump support to switchdev
      port's bridge_getlink.  iproute2 "bridge vlan show" cmd already knows how
      to show the vlans installed on the bridge and the device , but (until now)
      no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK
      msg.  Before this patch, "bridge vlan show":
      
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlans
      		 57
      
      	sw1p1				<< device side vlans (missing)
      
      	sw1p2    57
      
      	sw1p2
      
      	sw1p3
      
      	sw1p4
      
      	br0     None
      
      (When the port is bridged, the output repeats the vlan list for the vlans
      on the bridge side of the port and the vlans on the device side of the
      port.  The listing above show no vlans for the device side even though they
      are installed).
      
      After this patch:
      
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlan
      		 57
      
      	sw1p1    30-34			<< device side vlans
      		 57
      		 3840 PVID
      
      	sw1p2    57
      
      	sw1p2    57
      		 3840 PVID
      
      	sw1p3    3842 PVID
      
      	sw1p4    3843 PVID
      
      	br0     None
      
      I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func.
      switchdev support adds an obj dump for VLAN objects, using the same
      call-back scheme as FDB dump.  Support included for both compressed and
      un-compressed vlan dumps.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d4f8d87
  5. 14 5月, 2015 2 次提交
  6. 30 4月, 2015 1 次提交
    • N
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel 提交于
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c264da
  7. 14 4月, 2015 1 次提交
    • D
      net: use jump label patching for ingress qdisc in __netif_receive_skb_core · 4577139b
      Daniel Borkmann 提交于
      Even if we make use of classifier and actions from the egress
      path, we're going into handle_ing() executing additional code
      on a per-packet cost for ingress qdisc, just to realize that
      nothing is attached on ingress.
      
      Instead, this can just be blinded out as a no-op entirely with
      the use of a static key. On input fast-path, we already make
      use of static keys in various places, e.g. skb time stamping,
      in RPS, etc. It makes sense to not waste time when we're assured
      that no ingress qdisc is attached anywhere.
      
      Enabling/disabling of that code path is being done via two
      helpers, namely net_{inc,dec}_ingress_queue(), that are being
      invoked under RTNL mutex when a ingress qdisc is being either
      initialized or destructed.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4577139b
  8. 10 12月, 2014 1 次提交
    • M
      rtnetlink: delay RTM_DELLINK notification until after ndo_uninit() · 395eea6c
      Mahesh Bandewar 提交于
      The commit 56bfa7ee ("unregister_netdevice : move RTM_DELLINK to
      until after ndo_uninit") tried to do this ealier but while doing so
      it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
      delayed call to fill_info(). So this translated into asking driver
      to remove private state and then query it's private state. This
      could have catastropic consequences.
      
      This change breaks the rtmsg_ifinfo() into two parts - one takes the
      precise snapshot of the device by called fill_info() before calling
      the ndo_uninit() and the second part sends the notification using
      collected snapshot.
      
      It was brought to notice when last link is deleted from an ipvlan device
      when it has free-ed the port and the subsequent .fill_info() call is
      trying to get the info from the port.
      
      kernel: [  255.139429] ------------[ cut here ]------------
      kernel: [  255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
      kernel: [  255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
      kernel: [  255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
      kernel: [  255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
      kernel: [  255.139515]  0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
      kernel: [  255.139516]  0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
      kernel: [  255.139518]  00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
      kernel: [  255.139520] Call Trace:
      kernel: [  255.139527]  [<ffffffff815d87f4>] dump_stack+0x46/0x58
      kernel: [  255.139531]  [<ffffffff8109c29c>] warn_slowpath_common+0x8c/0xc0
      kernel: [  255.139540]  [<ffffffff8109c2ea>] warn_slowpath_null+0x1a/0x20
      kernel: [  255.139544]  [<ffffffff8150d570>] rtmsg_ifinfo+0x100/0x110
      kernel: [  255.139547]  [<ffffffff814f78b5>] rollback_registered_many+0x1d5/0x2d0
      kernel: [  255.139549]  [<ffffffff814f79cf>] unregister_netdevice_many+0x1f/0xb0
      kernel: [  255.139551]  [<ffffffff8150acab>] rtnl_dellink+0xbb/0x110
      kernel: [  255.139553]  [<ffffffff8150da90>] rtnetlink_rcv_msg+0xa0/0x240
      kernel: [  255.139557]  [<ffffffff81329283>] ? rhashtable_lookup_compare+0x43/0x80
      kernel: [  255.139558]  [<ffffffff8150d9f0>] ? __rtnl_unlock+0x20/0x20
      kernel: [  255.139562]  [<ffffffff8152cb11>] netlink_rcv_skb+0xb1/0xc0
      kernel: [  255.139563]  [<ffffffff8150a495>] rtnetlink_rcv+0x25/0x40
      kernel: [  255.139565]  [<ffffffff8152c398>] netlink_unicast+0x178/0x230
      kernel: [  255.139567]  [<ffffffff8152c75f>] netlink_sendmsg+0x30f/0x420
      kernel: [  255.139571]  [<ffffffff814e0b0c>] sock_sendmsg+0x9c/0xd0
      kernel: [  255.139575]  [<ffffffff811d1d7f>] ? rw_copy_check_uvector+0x6f/0x130
      kernel: [  255.139577]  [<ffffffff814e11c9>] ? copy_msghdr_from_user+0x139/0x1b0
      kernel: [  255.139578]  [<ffffffff814e1774>] ___sys_sendmsg+0x304/0x310
      kernel: [  255.139581]  [<ffffffff81198723>] ? handle_mm_fault+0xca3/0xde0
      kernel: [  255.139585]  [<ffffffff811ebc4c>] ? destroy_inode+0x3c/0x70
      kernel: [  255.139589]  [<ffffffff8108e6ec>] ? __do_page_fault+0x20c/0x500
      kernel: [  255.139597]  [<ffffffff811e8336>] ? dput+0xb6/0x190
      kernel: [  255.139606]  [<ffffffff811f05f6>] ? mntput+0x26/0x40
      kernel: [  255.139611]  [<ffffffff811d2b94>] ? __fput+0x174/0x1e0
      kernel: [  255.139613]  [<ffffffff814e2129>] __sys_sendmsg+0x49/0x90
      kernel: [  255.139615]  [<ffffffff814e2182>] SyS_sendmsg+0x12/0x20
      kernel: [  255.139617]  [<ffffffff815df092>] system_call_fastpath+0x12/0x17
      kernel: [  255.139619] ---[ end trace 5e6703e87d984f6b ]---
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reported-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      395eea6c
  9. 03 12月, 2014 2 次提交
  10. 14 9月, 2014 2 次提交
  11. 11 7月, 2014 1 次提交
  12. 16 5月, 2014 1 次提交
  13. 18 12月, 2013 1 次提交
  14. 26 10月, 2013 1 次提交
    • A
      net: fix rtnl notification in atomic context · 7f294054
      Alexei Starovoitov 提交于
      commit 991fb3f7 "dev: always advertise rx_flags changes via netlink"
      introduced rtnl notification from __dev_set_promiscuity(),
      which can be called in atomic context.
      
      Steps to reproduce:
      ip tuntap add dev tap1 mode tap
      ifconfig tap1 up
      tcpdump -nei tap1 &
      ip tuntap del dev tap1 mode tap
      
      [  271.627994] device tap1 left promiscuous mode
      [  271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940
      [  271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip
      [  271.677525] INFO: lockdep is turned off.
      [  271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G        W    3.12.0-rc3+ #73
      [  271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [  271.731254]  ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428
      [  271.760261]  ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8
      [  271.790683]  0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8
      [  271.822332] Call Trace:
      [  271.838234]  [<ffffffff817544e5>] dump_stack+0x55/0x76
      [  271.854446]  [<ffffffff8108bad1>] __might_sleep+0x181/0x240
      [  271.870836]  [<ffffffff81110ff8>] ? rcu_irq_exit+0x68/0xb0
      [  271.887076]  [<ffffffff811a80be>] kmem_cache_alloc_node+0x4e/0x2a0
      [  271.903368]  [<ffffffff810b4ddc>] ? vprintk_emit+0x1dc/0x5a0
      [  271.919716]  [<ffffffff81614d67>] ? __alloc_skb+0x57/0x2a0
      [  271.936088]  [<ffffffff810b4de0>] ? vprintk_emit+0x1e0/0x5a0
      [  271.952504]  [<ffffffff81614d67>] __alloc_skb+0x57/0x2a0
      [  271.968902]  [<ffffffff8163a0b2>] rtmsg_ifinfo+0x52/0x100
      [  271.985302]  [<ffffffff8162ac6d>] __dev_notify_flags+0xad/0xc0
      [  272.001642]  [<ffffffff8162ad0c>] __dev_set_promiscuity+0x8c/0x1c0
      [  272.017917]  [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380
      [  272.033961]  [<ffffffff8162b109>] dev_set_promiscuity+0x29/0x50
      [  272.049855]  [<ffffffff8172e937>] packet_dev_mc+0x87/0xc0
      [  272.065494]  [<ffffffff81732052>] packet_notifier+0x1b2/0x380
      [  272.080915]  [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380
      [  272.096009]  [<ffffffff81761c66>] notifier_call_chain+0x66/0x150
      [  272.110803]  [<ffffffff8108503e>] __raw_notifier_call_chain+0xe/0x10
      [  272.125468]  [<ffffffff81085056>] raw_notifier_call_chain+0x16/0x20
      [  272.139984]  [<ffffffff81620190>] call_netdevice_notifiers_info+0x40/0x70
      [  272.154523]  [<ffffffff816201d6>] call_netdevice_notifiers+0x16/0x20
      [  272.168552]  [<ffffffff816224c5>] rollback_registered_many+0x145/0x240
      [  272.182263]  [<ffffffff81622641>] rollback_registered+0x31/0x40
      [  272.195369]  [<ffffffff816229c8>] unregister_netdevice_queue+0x58/0x90
      [  272.208230]  [<ffffffff81547ca0>] __tun_detach+0x140/0x340
      [  272.220686]  [<ffffffff81547ed6>] tun_chr_close+0x36/0x60
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f294054
  15. 08 3月, 2013 1 次提交
  16. 01 11月, 2012 1 次提交
    • J
      ixgbe: add setlink, getlink support to ixgbe and ixgbevf · 815cccbf
      John Fastabend 提交于
      This adds support for the net device ops to manage the embedded
      hardware bridge on ixgbe devices. With this patch the bridge
      mode can be toggled between VEB and VEPA to support stacking
      macvlan devices or using the embedded switch without any SW
      component in 802.1Qbg/br environments.
      
      Additionally, this adds source address pruning to the ixgbevf
      driver to prune any frames sent back from a reflective relay on
      the switch. This is required because the existing hardware does
      not support this. Without it frames get pushed into the stack
      with its own src mac which is invalid per 802.1Qbg VEPA
      definition.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      815cccbf
  17. 13 10月, 2012 1 次提交
  18. 11 7月, 2012 1 次提交
  19. 28 6月, 2012 1 次提交
  20. 16 4月, 2012 1 次提交
    • J
      net: add generic PF_BRIDGE:RTM_ FDB hooks · 77162022
      John Fastabend 提交于
      This adds two new flags NTF_MASTER and NTF_SELF that can
      now be used to specify where PF_BRIDGE netlink commands should
      be sent. NTF_MASTER sends the commands to the 'dev->master'
      device for parsing. Typically this will be the linux net/bridge,
      or open-vswitch devices. Also without any flags set the command
      will be handled by the master device as well so that current user
      space tools continue to work as expected.
      
      The NTF_SELF flag will push the PF_BRIDGE commands to the
      device. In the basic example below the commands are then parsed
      and programmed in the embedded bridge.
      
      Note if both NTF_SELF and NTF_MASTER bits are set then the
      command will be sent to both 'dev->master' and 'dev' this allows
      user space to easily keep the embedded bridge and software bridge
      in sync.
      
      There is a slight complication in the case with both flags set
      when an error occurs. To resolve this the rtnl handler clears
      the NTF_ flag in the netlink ack to indicate which sets completed
      successfully. The add/del handlers will abort as soon as any
      error occurs.
      
      To support this new net device ops were added to call into
      the device and the existing bridging code was refactored
      to use these. There should be no required changes in user space
      to support the current bridge behavior.
      
      A basic setup with a SR-IOV enabled NIC looks like this,
      
                veth0  veth2
                  |      |
                ------------
                |  bridge0 |   <---- software bridging
                ------------
                     /
                     /
        ethx.y      ethx
          VF         PF
           \         \          <---- propagate FDB entries to HW
           \         \
        --------------------
        |  Embedded Bridge |    <---- hardware offloaded switching
        --------------------
      
      In this case the embedded bridge must be managed to allow 'veth0'
      to communicate with 'ethx.y' correctly. At present drivers managing
      the embedded bridge either send frames onto the network which
      then get dropped by the switch OR the embedded bridge will flood
      these frames. With this patch we have a mechanism to manage the
      embedded bridge correctly from user space. This example is specific
      to SR-IOV but replacing the VF with another PF or dropping this
      into the DSA framework generates similar management issues.
      
      Examples session using the 'br'[1] tool to add, dump and then
      delete a mac address with a new "embedded" option and enabled
      ixgbe driver:
      
      # br fdb add 22:35:19:ac:60:59 dev eth3
      # br fdb
      port    mac addr                flags
      veth0   22:35:19:ac:60:58       static
      veth0   9a:5f:81:f7:f6:ec       local
      eth3    00:1b:21:55:23:59       local
      eth3    22:35:19:ac:60:59       static
      veth0   22:35:19:ac:60:57       static
      #br fdb add 22:35:19:ac:60:59 embedded dev eth3
      #br fdb
      port    mac addr                flags
      veth0   22:35:19:ac:60:58       static
      veth0   9a:5f:81:f7:f6:ec       local
      eth3    00:1b:21:55:23:59       local
      eth3    22:35:19:ac:60:59       static
      veth0   22:35:19:ac:60:57       static
      eth3    22:35:19:ac:60:59       local embedded
      #br fdb del 22:35:19:ac:60:59 embedded dev eth3
      
      I added a couple lines to 'br' to set the flags correctly is all. It
      is my opinion that the merit of this patch is now embedded and SW
      bridges can both be modeled correctly in user space using very nearly
      the same message passing.
      
      [1] 'br' tool was published as an RFC here and will be renamed 'bridge'
          http://patchwork.ozlabs.org/patch/117664/
      
      Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for
      valuable feedback, suggestions, and review.
      
      v2: fixed api descriptions and error case with both NTF_SELF and
          NTF_MASTER set plus updated patch description.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77162022
  21. 22 2月, 2012 1 次提交
    • G
      rtnetlink: Fix problem with buffer allocation · 115c9b81
      Greg Rose 提交于
      Implement a new netlink attribute type IFLA_EXT_MASK.  The mask
      is a 32 bit value that can be used to indicate to the kernel that
      certain extended ifinfo values are requested by the user application.
      At this time the only mask value defined is RTEXT_FILTER_VF to
      indicate that the user wants the ifinfo dump to send information
      about the VFs belonging to the interface.
      
      This patch fixes a bug in which certain applications do not have
      large enough buffers to accommodate the extra information returned
      by the kernel with large numbers of SR-IOV virtual functions.
      Those applications will not send the new netlink attribute with
      the interface info dump request netlink messages so they will
      not get unexpectedly large request buffers returned by the kernel.
      
      Modifies the rtnl_calcit function to traverse the list of net
      devices and compute the minimum buffer size that can hold the
      info dumps of all matching devices based upon the filter passed
      in via the new netlink attribute filter mask.  If no filter
      mask is sent then the buffer allocation defaults to NLMSG_GOODSIZE.
      
      With this change it is possible to add yet to be defined netlink
      attributes to the dump request which should make it fairly extensible
      in the future.
      Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      115c9b81
  22. 09 7月, 2011 1 次提交
  23. 22 6月, 2011 1 次提交
    • J
      net: dcbnl, add multicast group for DCB · 314b4778
      John Fastabend 提交于
      Now that dcbnl is being used in many cases by more
      than a single agent it is beneficial to be notified
      when some entity either driver or user space has
      changed the DCB attributes.
      
      Today applications either end up polling the interface
      or relying on a user space database to maintain the DCB
      state and post events. Polling is a poor solution for
      obvious reasons. And relying on a user space database
      has its own downside. Namely it has created strange
      boot dependencies requiring the database be populated
      before any applications dependent on DCB attributes
      starts or the application goes into a polling loop.
      Populating the database requires negotiating link
      setting with the peer and can take anywhere from less
      than a second up to a few seconds depending on the switch
      implementation.
      
      Perhaps more importantly if another application or an
      embedded agent sets a DCB link attribute the database
      has no way of knowing other than polling the kernel.
      This prevents applications from responding quickly to
      changes in link events which at least in the FCoE case
      and probably any other protocols expecting a lossless
      link may result in IO errors.
      
      By adding a multicast group for DCB we have clean way
      to disseminate kernel DCB link attributes up to user
      space. Avoiding the need for user space to maintain
      a coherant database and disperse events that potentially
      do not reflect the current link state.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      314b4778
  24. 16 11月, 2010 1 次提交
    • A
      net: rtnetlink.h -- only include linux/netdevice.h when used by the kernel · 3b42a96d
      Andy Whitcroft 提交于
      The commit below added a new helper dev_ingress_queue to cleanly obtain the
      ingress queue pointer.  This necessitated including 'linux/netdevice.h':
      
        commit 24824a09
        Author: Eric Dumazet <eric.dumazet@gmail.com>
        Date:   Sat Oct 2 06:11:55 2010 +0000
      
          net: dynamic ingress_queue allocation
      
      However this include triggers issues for applications in userspace
      which use the rtnetlink interfaces.  Commonly this requires they include
      'net/if.h' and 'linux/rtnetlink.h' leading to a compiler error as below:
      
        In file included from /usr/include/linux/netdevice.h:28:0,
                         from /usr/include/linux/rtnetlink.h:9,
                         from t.c:2:
        /usr/include/linux/if.h:135:8: error: redefinition of ‘struct ifmap’
        /usr/include/net/if.h:112:8: note: originally defined here
        /usr/include/linux/if.h:169:8: error: redefinition of ‘struct ifreq’
        /usr/include/net/if.h:127:8: note: originally defined here
        /usr/include/linux/if.h:218:8: error: redefinition of ‘struct ifconf’
        /usr/include/net/if.h:177:8: note: originally defined here
      
      The new helper is only defined for the kernel and protected by __KERNEL__
      therefore we can simply pull the include down into the same protected
      section.
      Signed-off-by: NAndy Whitcroft <apw@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b42a96d
  25. 05 10月, 2010 2 次提交
  26. 16 9月, 2010 1 次提交
  27. 09 9月, 2010 1 次提交
  28. 23 7月, 2010 1 次提交
  29. 11 5月, 2010 1 次提交
    • P
      ipv6: ip6mr: support multiple tables · d1db275d
      Patrick McHardy 提交于
      This patch adds support for multiple independant multicast routing instances,
      named "tables".
      
      Userspace multicast routing daemons can bind to a specific table instance by
      issuing a setsockopt call using a new option MRT6_TABLE. The table number is
      stored in the raw socket data and affects all following ip6mr setsockopt(),
      getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT)
      is created with a default routing rule pointing to it. Newly created pim6reg
      devices have the table number appended ("pim6regX"), with the exception of
      devices created in the default table, which are named just "pim6reg" for
      compatibility reasons.
      
      Packets are directed to a specific table instance using routing rules,
      similar to how regular routing rules work. Currently iif, oif and mark
      are supported as keys, source and destination addresses could be supported
      additionally.
      
      Example usage:
      
      - bind pimd/xorp/... to a specific table:
      
      uint32_t table = 123;
      setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table));
      
      - create routing rules directing packets to the new table:
      
      # ip -6 mrule add iif eth0 lookup 123
      # ip -6 mrule add oif eth0 lookup 123
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      d1db275d
  30. 26 4月, 2010 1 次提交
    • P
      net: rtnetlink: decouple rtnetlink address families from real address families · 25239cee
      Patrick McHardy 提交于
      Decouple rtnetlink address families from real address families in socket.h to
      be able to add rtnetlink interfaces to code that is not a real address family
      without increasing AF_MAX/NPROTO.
      
      This will be used to add support for multicast route dumping from all tables
      as the proc interface can't be extended to support anything but the main table
      without breaking compatibility.
      
      This partialy undoes the patch to introduce independant families for routing
      rules and converts ipmr routing rules to a new rtnetlink family. Similar to
      that patch, values up to 127 are reserved for real address families, values
      above that may be used arbitrarily.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      25239cee
  31. 25 2月, 2010 1 次提交
    • P
      net: Add checking to rcu_dereference() primitives · a898def2
      Paul E. McKenney 提交于
      Update rcu_dereference() primitives to use new lockdep-based
      checking. The rcu_dereference() in __in6_dev_get() may be
      protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
      The rcu_dereference() in __sk_free() is protected by the fact
      that it is never reached if an update could change it.  Check
      for this by using rcu_dereference_check() to verify that the
      struct sock's ->sk_wmem_alloc counter is zero.
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a898def2
  32. 24 12月, 2009 1 次提交
    • L
      net: Add rtnetlink init_rcvwnd to set the TCP initial receive window · 31d12926
      laurent chavey 提交于
      Add rtnetlink init_rcvwnd to set the TCP initial receive window size
      advertised by passive and active TCP connections.
      The current Linux TCP implementation limits the advertised TCP initial
      receive window to the one prescribed by slow start. For short lived
      TCP connections used for transaction type of traffic (i.e. http
      requests), bounding the advertised TCP initial receive window results
      in increased latency to complete the transaction.
      Support for setting initial congestion window is already supported
      using rtnetlink init_cwnd, but the feature is useless without the
      ability to set a larger TCP initial receive window.
      The rtnetlink init_rcvwnd allows increasing the TCP initial receive
      window, allowing TCP connection to advertise larger TCP receive window
      than the ones bounded by slow start.
      Signed-off-by: NLaurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31d12926
  33. 16 12月, 2009 1 次提交
    • D
      tcp: Revert per-route SACK/DSACK/TIMESTAMP changes. · bb5b7c11
      David S. Miller 提交于
      It creates a regression, triggering badness for SYN_RECV
      sockets, for example:
      
      [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
      [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
      [19148.023035] REGS: eeecbd30 TRAP: 0700   Not tainted  (2.6.32)
      [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR>  CR: 24002442  XER: 00000000
      [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000
      
      This is likely caused by the change in the 'estab' parameter
      passed to tcp_parse_options() when invoked by the functions
      in net/ipv4/tcp_minisocks.c
      
      But even if that is fixed, the ->conn_request() changes made in
      this patch series is fundamentally wrong.  They try to use the
      listening socket's 'dst' to probe the route settings.  The
      listening socket doesn't even have a route, and you can't
      get the right route (the child request one) until much later
      after we setup all of the state, and it must be done by hand.
      
      This stuff really isn't ready, so the best thing to do is a
      full revert.  This reverts the following commits:
      
      f55017a9
      022c3f7d
      1aba721e
      cda42ebd
      345cda2f
      dc343475
      05eaade2
      6a2a2d6bSigned-off-by: NDavid S. Miller <davem@davemloft.net>
      bb5b7c11
  34. 05 11月, 2009 1 次提交
  35. 29 10月, 2009 2 次提交