1. 30 12月, 2012 1 次提交
  2. 29 12月, 2012 1 次提交
  3. 22 12月, 2012 1 次提交
  4. 09 12月, 2012 1 次提交
  5. 08 12月, 2012 1 次提交
    • E
      net: gro: fix possible panic in skb_gro_receive() · c3c7c254
      Eric Dumazet 提交于
      commit 2e71a6f8 (net: gro: selective flush of packets) added
      a bug for skbs using frag_list. This part of the GRO stack is rarely
      used, as it needs skb not using a page fragment for their skb->head.
      
      Most drivers do use a page fragment, but some of them use GFP_KERNEL
      allocations for the initial fill of their RX ring buffer.
      
      napi_gro_flush() overwrite skb->prev that was used for these skb to
      point to the last skb in frag_list.
      
      Fix this using a separate field in struct napi_gro_cb to point to the
      last fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3c7c254
  6. 30 11月, 2012 1 次提交
  7. 27 11月, 2012 1 次提交
    • B
      sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name · c91f6df2
      Brian Haley 提交于
      Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
      will then require another call like if_indextoname() to get the actual interface
      name, have it return the name directly.
      
      This also matches the existing man page description on socket(7) which mentions
      the argument being an interface name.
      
      If the value has not been set, zero is returned and optlen will be set to zero
      to indicate there is no interface name present.
      
      Added a seqlock to protect this code path, and dev_ifname(), from someone
      changing the device name via dev_change_name().
      
      v2: Added seqlock protection while copying device name.
      
      v3: Fixed word wrap in patch.
      Signed-off-by: NBrian Haley <brian.haley@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c91f6df2
  8. 19 11月, 2012 1 次提交
  9. 16 11月, 2012 3 次提交
  10. 01 11月, 2012 1 次提交
    • J
      net: create generic bridge ops · e5a55a89
      John Fastabend 提交于
      The PF_BRIDGE:RTM_{GET|SET}LINK nlmsg family and type are
      currently embedded in the ./net/bridge module. This prohibits
      them from being used by other bridging devices. One example
      of this being hardware that has embedded bridging components.
      
      In order to use these nlmsg types more generically this patch
      adds two net_device_ops hooks. One to set link bridge attributes
      and another to dump the current bride attributes.
      
      	ndo_bridge_setlink()
      	ndo_bridge_getlink()
      
      CC: Lennert Buytenhek <buytenh@wantstofly.org>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5a55a89
  11. 13 10月, 2012 1 次提交
  12. 09 10月, 2012 2 次提交
    • E
      ipv6: gro: fix PV6_GRO_CB(skb)->proto problem · 86347245
      Eric Dumazet 提交于
      It seems IPV6_GRO_CB(skb)->proto can be destroyed in skb_gro_receive()
      if a new skb is allocated (to serve as an anchor for frag_list)
      
      We copy NAPI_GRO_CB() only (not the IPV6 specific part) in :
      
      *NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p);
      
      So we leave IPV6_GRO_CB(nskb)->proto to 0 (fresh skb allocation) instead
      of IPPROTO_TCP (6)
      
      ipv6_gro_complete() isnt able to call ops->gro_complete()
      [ tcp6_gro_complete() ]
      
      Fix this by moving proto in NAPI_GRO_CB() and getting rid of
      IPV6_GRO_CB
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86347245
    • E
      net: gro: selective flush of packets · 2e71a6f8
      Eric Dumazet 提交于
      Current GRO can hold packets in gro_list for almost unlimited
      time, in case napi->poll() handler consumes its budget over and over.
      
      In this case, napi_complete()/napi_gro_flush() are not called.
      
      Another problem is that gro_list is flushed in non friendly way :
      We scan the list and complete packets in the reverse order.
      (youngest packets first, oldest packets last)
      This defeats priorities that sender could have cooked.
      
      Since GRO currently only store TCP packets, we dont really notice the
      bug because of retransmits, but this behavior can add unexpected
      latencies, particularly on mice flows clamped by elephant flows.
      
      This patch makes sure no packet can stay more than 1 ms in queue, and
      only in stress situations.
      
      It also complete packets in the right order to minimize latencies.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e71a6f8
  13. 08 10月, 2012 1 次提交
  14. 02 10月, 2012 1 次提交
  15. 28 9月, 2012 1 次提交
  16. 20 9月, 2012 2 次提交
  17. 17 9月, 2012 1 次提交
  18. 06 9月, 2012 1 次提交
    • E
      net: qdisc busylock needs lockdep annotations · 23d3b8bf
      Eric Dumazet 提交于
      It seems we need to provide ability for stacked devices
      to use specific lock_class_key for sch->busylock
      
      We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
      a user might use a qdisc anyway.
      
      (So same fixes are probably needed on non LLTX stacked drivers)
      
      Noticed while stressing L2TPV3 setup :
      
      ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.6.0-rc3+ #788 Not tainted
       -------------------------------------------------------
       netperf/4660 is trying to acquire lock:
        (l2tpsock){+.-...}, at: [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
      
       but task is already holding lock:
        (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&(&sch->busylock)->rlock){+.-...}:
              [<ffffffff810a5df0>] lock_acquire+0x90/0x200
              [<ffffffff817499fc>] _raw_spin_lock_irqsave+0x4c/0x60
              [<ffffffff81074872>] __wake_up+0x32/0x70
              [<ffffffff8136d39e>] tty_wakeup+0x3e/0x80
              [<ffffffff81378fb3>] pty_write+0x73/0x80
              [<ffffffff8136cb4c>] tty_put_char+0x3c/0x40
              [<ffffffff813722b2>] process_echoes+0x142/0x330
              [<ffffffff813742ab>] n_tty_receive_buf+0x8fb/0x1230
              [<ffffffff813777b2>] flush_to_ldisc+0x142/0x1c0
              [<ffffffff81062818>] process_one_work+0x198/0x760
              [<ffffffff81063236>] worker_thread+0x186/0x4b0
              [<ffffffff810694d3>] kthread+0x93/0xa0
              [<ffffffff81753e24>] kernel_thread_helper+0x4/0x10
      
       -> #0 (l2tpsock){+.-...}:
              [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10
              [<ffffffff810a5df0>] lock_acquire+0x90/0x200
              [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50
              [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
              [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
              [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70
              [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290
              [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00
              [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890
              [<ffffffff815db019>] ip_output+0x59/0xf0
              [<ffffffff815da36d>] ip_local_out+0x2d/0xa0
              [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680
              [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60
              [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30
              [<ffffffff815f5300>] tcp_push_one+0x30/0x40
              [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040
              [<ffffffff81614495>] inet_sendmsg+0x125/0x230
              [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0
              [<ffffffff81579ece>] sys_sendto+0xfe/0x130
              [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&(&sch->busylock)->rlock);
                                      lock(l2tpsock);
                                      lock(&(&sch->busylock)->rlock);
         lock(l2tpsock);
      
        *** DEADLOCK ***
      
       5 locks held by netperf/4660:
        #0:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e581c>] tcp_sendmsg+0x2c/0x1040
        #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff815da3e0>] ip_queue_xmit+0x0/0x680
        #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff815d9ac5>] ip_finish_output+0x135/0x890
        #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81595820>] dev_queue_xmit+0x0/0xe00
        #4:  (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00
      
       stack backtrace:
       Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
       Call Trace:
        [<ffffffff8173dbf8>] print_circular_bug+0x1fb/0x20c
        [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10
        [<ffffffff810a334b>] ? check_usage+0x9b/0x4d0
        [<ffffffff810a3f44>] ? __lock_acquire+0x2e4/0x1b10
        [<ffffffff810a5df0>] lock_acquire+0x90/0x200
        [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
        [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50
        [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
        [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
        [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
        [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70
        [<ffffffff81594e0e>] ? dev_hard_start_xmit+0x5e/0xa70
        [<ffffffff81595961>] ? dev_queue_xmit+0x141/0xe00
        [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290
        [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00
        [<ffffffff81595820>] ? dev_hard_start_xmit+0xa70/0xa70
        [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890
        [<ffffffff815d9ac5>] ? ip_finish_output+0x135/0x890
        [<ffffffff815db019>] ip_output+0x59/0xf0
        [<ffffffff815da36d>] ip_local_out+0x2d/0xa0
        [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680
        [<ffffffff815da3e0>] ? ip_local_out+0xa0/0xa0
        [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60
        [<ffffffff815fa25e>] ? tcp_md5_do_lookup+0x18e/0x1a0
        [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30
        [<ffffffff815f5300>] tcp_push_one+0x30/0x40
        [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040
        [<ffffffff81614495>] inet_sendmsg+0x125/0x230
        [<ffffffff81614370>] ? inet_create+0x6b0/0x6b0
        [<ffffffff8157e6e2>] ? sock_update_classid+0xc2/0x3b0
        [<ffffffff8157e750>] ? sock_update_classid+0x130/0x3b0
        [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0
        [<ffffffff81162579>] ? fget_light+0x3f9/0x4f0
        [<ffffffff81579ece>] sys_sendto+0xfe/0x130
        [<ffffffff810a69ad>] ? trace_hardirqs_on+0xd/0x10
        [<ffffffff8174a0b0>] ? _raw_spin_unlock_irq+0x30/0x50
        [<ffffffff810757e3>] ? finish_task_switch+0x83/0xf0
        [<ffffffff810757a6>] ? finish_task_switch+0x46/0xf0
        [<ffffffff81752cb7>] ? sysret_check+0x1b/0x56
        [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23d3b8bf
  19. 25 8月, 2012 1 次提交
    • B
      net: Set device operstate at registration time · 8f4cccbb
      Ben Hutchings 提交于
      The operstate of a device is initially IF_OPER_UNKNOWN and is updated
      asynchronously by linkwatch after each change of carrier state
      reported by the driver.  The default carrier state of a net device is
      on, and this will never be changed on drivers that do not support
      carrier detection, thus the operstate remains IF_OPER_UNKNOWN.
      
      For devices that do support carrier detection, the driver must set the
      carrier state to off initially, then poll the hardware state when the
      device is opened.  However, we must not activate linkwatch for a
      unregistered device, and commit b4730016 ('net: Do not fire linkwatch
      events until the device is registered.') ensured that we don't.  But
      this means that the operstate for many devices that support carrier
      detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN.
      
      The same issue exists with the dormant state.
      
      The proper initialisation sequence, avoiding a race with opening of
      the device, is:
      
              rtnl_lock();
              rc = register_netdevice(dev);
              if (rc)
                      goto out_unlock;
              netif_carrier_off(dev); /* or netif_dormant_on(dev) */
              rtnl_unlock();
      
      but it seems silly that this should have to be repeated in so many
      drivers.  Further, the operstate seen immediately after opening the
      device may still be IF_OPER_UNKNOWN due to the asynchronous nature of
      linkwatch.
      
      Commit 22604c86 ('net: Fix for initial link state in 2.6.28') attempted
      to fix this by setting the operstate synchronously, but it was
      reverted as it could lead to deadlock.
      
      This initialises the operstate synchronously at registration time
      only.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f4cccbb
  20. 23 8月, 2012 1 次提交
    • E
      net: remove delay at device dismantle · 0115e8e3
      Eric Dumazet 提交于
      I noticed extra one second delay in device dismantle, tracked down to
      a call to dst_dev_event() while some call_rcu() are still in RCU queues.
      
      These call_rcu() were posted by rt_free(struct rtable *rt) calls.
      
      We then wait a little (but one second) in netdev_wait_allrefs() before
      kicking again NETDEV_UNREGISTER.
      
      As the call_rcu() are now completed, dst_dev_event() can do the needed
      device swap on busy dst.
      
      To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
      after a rcu_barrier(), but outside of RTNL lock.
      
      Use NETDEV_UNREGISTER_FINAL with care !
      
      Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
      
      Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
      IP cache removal.
      
      With help from Gao feng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0115e8e3
  21. 20 8月, 2012 1 次提交
    • E
      af_packet: don't emit packet on orig fanout group · c0de08d0
      Eric Leblond 提交于
      If a packet is emitted on one socket in one group of fanout sockets,
      it is transmitted again. It is thus read again on one of the sockets
      of the fanout group. This result in a loop for software which
      generate packets when receiving one.
      This retransmission is not the intended behavior: a fanout group
      must behave like a single socket. The packet should not be
      transmitted on a socket if it originates from a socket belonging
      to the same fanout group.
      
      This patch fixes the issue by changing the transmission check to
      take fanout group info account.
      Reported-by: NAleksandr Kotov <a1k@mail.ru>
      Signed-off-by: NEric Leblond <eric@regit.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0de08d0
  22. 15 8月, 2012 3 次提交
  23. 02 8月, 2012 1 次提交
    • B
      net: Allow driver to limit number of GSO segments per skb · 30b678d8
      Ben Hutchings 提交于
      A peer (or local user) may cause TCP to use a nominal MSS of as little
      as 88 (actual MSS of 76 with timestamps).  Given that we have a
      sufficiently prodigious local sender and the peer ACKs quickly enough,
      it is nevertheless possible to grow the window for such a connection
      to the point that we will try to send just under 64K at once.  This
      results in a single skb that expands to 861 segments.
      
      In some drivers with TSO support, such an skb will require hundreds of
      DMA descriptors; a substantial fraction of a TX ring or even more than
      a full ring.  The TX queue selected for the skb may stall and trigger
      the TX watchdog repeatedly (since the problem skb will be retried
      after the TX reset).  This particularly affects sfc, for which the
      issue is designated as CVE-2012-3412.
      
      Therefore:
      1. Add the field net_device::gso_max_segs holding the device-specific
         limit.
      2. In netif_skb_features(), if the number of segments is too high then
         mask out GSO features to force fall back to software GSO.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30b678d8
  24. 21 7月, 2012 1 次提交
  25. 05 7月, 2012 1 次提交
  26. 13 6月, 2012 1 次提交
    • M
      net-next: add dev_loopback_xmit() to avoid duplicate code · 95603e22
      Michel Machado 提交于
      Add dev_loopback_xmit() in order to deduplicate functions
      ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
      ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).
      
      I was about to reinvent the wheel when I noticed that
      ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
      need and are not IP-only functions, but they were not available to reuse
      elsewhere.
      
      ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
      understand that this is harmless, and should be in dev_loopback_xmit().
      Signed-off-by: NMichel Machado <michel@digirati.com.br>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jpirko@redhat.com>
      CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95603e22
  27. 12 6月, 2012 1 次提交
  28. 31 5月, 2012 1 次提交
  29. 19 5月, 2012 1 次提交
  30. 18 5月, 2012 1 次提交
  31. 11 5月, 2012 1 次提交
  32. 01 5月, 2012 1 次提交
    • E
      net: make GRO aware of skb->head_frag · d7e8883c
      Eric Dumazet 提交于
      GRO can check if skb to be merged has its skb->head mapped to a page
      fragment, instead of a kmalloc() area.
      
      We 'upgrade' skb->head as a fragment in itself
      
      This avoids the frag_list fallback, and permits to build true GRO skb
      (one sk_buff and up to 16 fragments), using less memory.
      
      This reduces number of cache misses when user makes its copy, since a
      single sk_buff is fetched.
      
      This is a followup of patch "net: allow skb->head to be a page fragment"
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7e8883c
  33. 16 4月, 2012 2 次提交
    • J
      net: addr_list: add exclusive dev_uc_add and dev_mc_add · 12a94634
      John Fastabend 提交于
      This adds a dev_uc_add_excl() and dev_mc_add_excl() calls
      similar to the original dev_{uc|mc}_add() except it sets
      the global bit and returns -EEXIST for duplicat entires.
      
      This is useful for drivers that support SR-IOV, macvlan
      devices and any other devices that need to manage the
      unicast and multicast lists.
      
      v2: fix typo UNICAST should be MULTICAST in dev_mc_add_excl()
      
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12a94634
    • J
      net: add generic PF_BRIDGE:RTM_ FDB hooks · 77162022
      John Fastabend 提交于
      This adds two new flags NTF_MASTER and NTF_SELF that can
      now be used to specify where PF_BRIDGE netlink commands should
      be sent. NTF_MASTER sends the commands to the 'dev->master'
      device for parsing. Typically this will be the linux net/bridge,
      or open-vswitch devices. Also without any flags set the command
      will be handled by the master device as well so that current user
      space tools continue to work as expected.
      
      The NTF_SELF flag will push the PF_BRIDGE commands to the
      device. In the basic example below the commands are then parsed
      and programmed in the embedded bridge.
      
      Note if both NTF_SELF and NTF_MASTER bits are set then the
      command will be sent to both 'dev->master' and 'dev' this allows
      user space to easily keep the embedded bridge and software bridge
      in sync.
      
      There is a slight complication in the case with both flags set
      when an error occurs. To resolve this the rtnl handler clears
      the NTF_ flag in the netlink ack to indicate which sets completed
      successfully. The add/del handlers will abort as soon as any
      error occurs.
      
      To support this new net device ops were added to call into
      the device and the existing bridging code was refactored
      to use these. There should be no required changes in user space
      to support the current bridge behavior.
      
      A basic setup with a SR-IOV enabled NIC looks like this,
      
                veth0  veth2
                  |      |
                ------------
                |  bridge0 |   <---- software bridging
                ------------
                     /
                     /
        ethx.y      ethx
          VF         PF
           \         \          <---- propagate FDB entries to HW
           \         \
        --------------------
        |  Embedded Bridge |    <---- hardware offloaded switching
        --------------------
      
      In this case the embedded bridge must be managed to allow 'veth0'
      to communicate with 'ethx.y' correctly. At present drivers managing
      the embedded bridge either send frames onto the network which
      then get dropped by the switch OR the embedded bridge will flood
      these frames. With this patch we have a mechanism to manage the
      embedded bridge correctly from user space. This example is specific
      to SR-IOV but replacing the VF with another PF or dropping this
      into the DSA framework generates similar management issues.
      
      Examples session using the 'br'[1] tool to add, dump and then
      delete a mac address with a new "embedded" option and enabled
      ixgbe driver:
      
      # br fdb add 22:35:19:ac:60:59 dev eth3
      # br fdb
      port    mac addr                flags
      veth0   22:35:19:ac:60:58       static
      veth0   9a:5f:81:f7:f6:ec       local
      eth3    00:1b:21:55:23:59       local
      eth3    22:35:19:ac:60:59       static
      veth0   22:35:19:ac:60:57       static
      #br fdb add 22:35:19:ac:60:59 embedded dev eth3
      #br fdb
      port    mac addr                flags
      veth0   22:35:19:ac:60:58       static
      veth0   9a:5f:81:f7:f6:ec       local
      eth3    00:1b:21:55:23:59       local
      eth3    22:35:19:ac:60:59       static
      veth0   22:35:19:ac:60:57       static
      eth3    22:35:19:ac:60:59       local embedded
      #br fdb del 22:35:19:ac:60:59 embedded dev eth3
      
      I added a couple lines to 'br' to set the flags correctly is all. It
      is my opinion that the merit of this patch is now embedded and SW
      bridges can both be modeled correctly in user space using very nearly
      the same message passing.
      
      [1] 'br' tool was published as an RFC here and will be renamed 'bridge'
          http://patchwork.ozlabs.org/patch/117664/
      
      Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for
      valuable feedback, suggestions, and review.
      
      v2: fixed api descriptions and error case with both NTF_SELF and
          NTF_MASTER set plus updated patch description.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77162022