1. 24 6月, 2022 1 次提交
  2. 01 6月, 2022 1 次提交
  3. 12 3月, 2022 1 次提交
  4. 21 2月, 2022 2 次提交
  5. 27 1月, 2022 1 次提交
  6. 24 1月, 2022 1 次提交
  7. 30 11月, 2021 1 次提交
  8. 12 8月, 2021 1 次提交
  9. 10 8月, 2021 1 次提交
    • J
      net, bonding: Add XDP support to the bonding driver · 9e2ee5c7
      Jussi Maki 提交于
      XDP is implemented in the bonding driver by transparently delegating
      the XDP program loading, removal and xmit operations to the bonding
      slave devices. The overall goal of this work is that XDP programs
      can be attached to a bond device *without* any further changes (or
      awareness) necessary to the program itself, meaning the same XDP
      program can be attached to a native device but also a bonding device.
      
      Semantics of XDP_TX when attached to a bond are equivalent in such
      setting to the case when a tc/BPF program would be attached to the
      bond, meaning transmitting the packet out of the bond itself using one
      of the bond's configured xmit methods to select a slave device (rather
      than XDP_TX on the slave itself). Handling of XDP_TX to transmit
      using the configured bonding mechanism is therefore implemented by
      rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
      performance impact this check is guarded by a static key, which is
      incremented when a XDP program is loaded onto a bond device. This
      approach was chosen to avoid changes to drivers implementing XDP. If
      the slave device does not match the receive device, then XDP_REDIRECT
      is transparently used to perform the redirection in order to have
      the network driver release the packet from its RX ring. The bonding
      driver hashing functions have been refactored to allow reuse with
      xdp_buff's to avoid code duplication.
      
      The motivation for this change is to enable use of bonding (and
      802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
      XDP and also to transparently support bond devices for projects that
      use XDP given most modern NICs have dual port adapters. An alternative
      to this approach would be to implement 802.3ad in user-space and
      implement the bonding load-balancing in the XDP program itself, but
      is rather a cumbersome endeavor in terms of slave device management
      (e.g. by watching netlink) and requires separate programs for native
      vs bond cases for the orchestrator. A native in-kernel implementation
      overcomes these issues and provides more flexibility.
      
      Below are benchmark results done on two machines with 100Gbit
      Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
      16-core 3950X on receiving machine. 64 byte packets were sent with
      pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
      ice driver, so the tests were performed with iommu=off and patch [2]
      applied. Additionally the bonding round robin algorithm was modified
      to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
      of cache misses were caused by the shared rr_tx_counter (see patch
      2/3). The statistics were collected using "sar -n dev -u 1 10". On top
      of that, for ice, further work is in progress on improving the XDP_TX
      numbers [4].
      
       -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
       without patch (1 dev):
         XDP_DROP:              3.15%      48.6Mpps
         XDP_TX:                3.12%      18.3Mpps     18.3Mpps
         XDP_DROP (RSS):        9.47%      116.5Mpps
         XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
       -----------------------
       with patch, bond (1 dev):
         XDP_DROP:              3.14%      46.7Mpps
         XDP_TX:                3.15%      13.9Mpps     13.9Mpps
         XDP_DROP (RSS):        10.33%     117.2Mpps
         XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
       -----------------------
       with patch, bond (2 devs):
         XDP_DROP:              6.27%      92.7Mpps
         XDP_TX:                6.26%      17.6Mpps     17.5Mpps
         XDP_DROP (RSS):       11.38%      117.2Mpps
         XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
       --------------------------------------------------------------
      
      RSS: Receive Side Scaling, e.g. the packets were sent to a range of
      destination IPs.
      
        [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
        [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
        [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
        [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#tSigned-off-by: NJussi Maki <joamaki@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
      9e2ee5c7
  10. 03 8月, 2021 1 次提交
    • H
      bonding: add new option lacp_active · 3a755cd8
      Hangbin Liu 提交于
      Add an option lacp_active, which is similar with team's runner.active.
      This option specifies whether to send LACPDU frames periodically. If set
      on, the LACPDU frames are sent along with the configured lacp_rate
      setting. If set off, the LACPDU frames acts as "speak when spoken to".
      
      Note, the LACPDU state frames still will be sent when init or unbind port.
      
      v2: remove module parameter
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a755cd8
  11. 07 7月, 2021 1 次提交
    • T
      bonding: Add struct bond_ipesc to manage SA · 9a560550
      Taehee Yoo 提交于
      bonding has been supporting ipsec offload.
      When SA is added, bonding just passes SA to its own active real interface.
      But it doesn't manage SA.
      So, when events(add/del real interface, active real interface change, etc)
      occur, bonding can't handle that well because It doesn't manage SA.
      So some problems(panic, UAF, refcnt leak)occur.
      
      In order to make it stable, it should manage SA.
      That's the reason why struct bond_ipsec is added.
      When a new SA is added to bonding interface, it is stored in the
      bond_ipsec list. And the SA is passed to a current active real interface.
      If events occur, it uses bond_ipsec data to handle these events.
      bond->ipsec_list is protected by bond->ipsec_lock.
      
      If a current active real interface is changed, the following logic works.
      1. delete all SAs from old active real interface
      2. Add all SAs to the new active real interface.
      3. If a new active real interface doesn't support ipsec offload or SA's
      option, it sets real_dev to NULL.
      
      Fixes: 18cb261a ("bonding: support hardware encryption offload to slaves")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a560550
  12. 16 6月, 2021 1 次提交
    • J
      net: bonding: Use per-cpu rr_tx_counter · 848ca918
      Jussi Maki 提交于
      The round-robin rr_tx_counter was shared across CPUs leading to
      significant cache thrashing at high packet rates. This patch switches
      the round-robin packet counter to use a per-cpu variable to decide
      the destination slave.
      
      On a test with 2x100Gbit ICE nic with pktgen_sample_04_many_flows.sh
      (-s 64 -t 32) the tx rate was 19.6Mpps before and 22.3Mpps after
      this patch.
      
      "perf top -e cache_misses" before:
          12.31%  [bonding]       [k] bond_xmit_roundrobin_slave_get
          10.59%  [sch_fq_codel]  [k] fq_codel_dequeue
           9.34%  [kernel]        [k] skb_release_data
      after:
          15.42%  [sch_fq_codel]  [k] fq_codel_dequeue
          10.06%  [kernel]        [k] __memset
           9.12%  [kernel]        [k] skb_release_data
      Signed-off-by: NJussi Maki <joamaki@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      848ca918
  13. 19 1月, 2021 3 次提交
    • T
      net/bonding: Declare TLS RX device offload support · dc5809f9
      Tariq Toukan 提交于
      Following the description in previous patch (for TX):
      As the bond interface is being bypassed by the TLS module, interacting
      directly against the lower devs, there is no way for the bond interface
      to disable its device offload capabilities, as long as the mode/policy
      config allows it.
      Hence, the feature flag is not directly controllable, but just reflects
      the offload status based on the logic under bond_sk_check().
      
      Here we just declare RX device offload support, and expose it via the
      NETIF_F_HW_TLS_RX flag.
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      dc5809f9
    • T
      net/bonding: Implement TLS TX device offload · 89df6a81
      Tariq Toukan 提交于
      Implement TLS TX device offload for bonding interfaces.
      This allows kTLS sockets running on a bond to benefit from the
      device offload on capable lower devices.
      
      To allow a simple and fast maintenance of the TLS context in SW and
      lower devices, we bind the TLS socket to a specific lower dev.
      To achieve a behavior similar to SW kTLS, we support only balance-xor
      and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced
      in bond_sk_check(), done in a previous patch.
      
      For the above configuration, the SW implementation keeps picking the
      same exact lower dev for all the socket's SKBs. The device offload
      behaves similarly, making the decision once at the connection creation.
      
      Per socket, the TLS module should work directly with the lowest netdev
      in chain, to call the tls_dev_ops operations.
      
      As the bond interface is being bypassed by the TLS module, interacting
      directly against the lower devs, there is no way for the bond interface
      to disable its device offload capabilities, as long as the mode/policy
      config allows it.
      Hence, the feature flag is not directly controllable, but just reflects
      the current offload status based on the logic under bond_sk_check().
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      89df6a81
    • T
      net/bonding: Implement ndo_sk_get_lower_dev · 007feb87
      Tariq Toukan 提交于
      Add ndo_sk_get_lower_dev() implementation for bond interfaces.
      
      Support only for the cases where the socket's and SKBs' hash
      yields identical value for the whole connection lifetime.
      
      Here we restrict it to L3+4 sockets only, with
      xmit_hash_policy==LAYER34 and bond modes xor/802.3ad.
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      007feb87
  14. 09 12月, 2020 1 次提交
    • J
      bonding: fix feature flag setting at init time · 007ab534
      Jarod Wilson 提交于
      Don't try to adjust XFRM support flags if the bond device isn't yet
      registered. Bad things can currently happen when netdev_change_features()
      is called without having wanted_features fully filled in yet. This code
      runs both on post-module-load mode changes, as well as at module init
      time, and when run at module init time, it is before register_netdevice()
      has been called and filled in wanted_features. The empty wanted_features
      led to features also getting emptied out, which was definitely not the
      intended behavior, so prevent that from happening.
      
      Originally, I'd hoped to stop adjusting wanted_features at all in the
      bonding driver, as it's documented as being something only the network
      core should touch, but we actually do need to do this to properly update
      both the features and wanted_features fields when changing the bond type,
      or we get to a situation where ethtool sees:
      
          esp-hw-offload: off [requested on]
      
      I do think we should be using netdev_update_features instead of
      netdev_change_features here though, so we only send notifiers when the
      features actually changed.
      
      Fixes: a3b658cf ("bonding: allow xfrm offload setup post-module-load")
      Reported-by: NIvan Vecera <ivecera@redhat.com>
      Suggested-by: NIvan Vecera <ivecera@redhat.com>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Link: https://lore.kernel.org/r/20201205172229.576587-1-jarod@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      007ab534
  15. 22 11月, 2020 1 次提交
    • J
      bonding: wait for sysfs kobject destruction before freeing struct slave · b9ad3e9f
      Jamie Iles 提交于
      syzkaller found that with CONFIG_DEBUG_KOBJECT_RELEASE=y, releasing a
      struct slave device could result in the following splat:
      
        kobject: 'bonding_slave' (00000000cecdd4fe): kobject_release, parent 0000000074ceb2b2 (delayed 1000)
        bond0 (unregistering): (slave bond_slave_1): Releasing backup interface
        ------------[ cut here ]------------
        ODEBUG: free active (active state 0) object type: timer_list hint: workqueue_select_cpu_near kernel/workqueue.c:1549 [inline]
        ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x98 kernel/workqueue.c:1600
        WARNING: CPU: 1 PID: 842 at lib/debugobjects.c:485 debug_print_object+0x180/0x240 lib/debugobjects.c:485
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 842 Comm: kworker/u4:4 Tainted: G S                5.9.0-rc8+ #96
        Hardware name: linux,dummy-virt (DT)
        Workqueue: netns cleanup_net
        Call trace:
         dump_backtrace+0x0/0x4d8 include/linux/bitmap.h:239
         show_stack+0x34/0x48 arch/arm64/kernel/traps.c:142
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x174/0x1f8 lib/dump_stack.c:118
         panic+0x360/0x7a0 kernel/panic.c:231
         __warn+0x244/0x2ec kernel/panic.c:600
         report_bug+0x240/0x398 lib/bug.c:198
         bug_handler+0x50/0xc0 arch/arm64/kernel/traps.c:974
         call_break_hook+0x160/0x1d8 arch/arm64/kernel/debug-monitors.c:322
         brk_handler+0x30/0xc0 arch/arm64/kernel/debug-monitors.c:329
         do_debug_exception+0x184/0x340 arch/arm64/mm/fault.c:864
         el1_dbg+0x48/0xb0 arch/arm64/kernel/entry-common.c:65
         el1_sync_handler+0x170/0x1c8 arch/arm64/kernel/entry-common.c:93
         el1_sync+0x80/0x100 arch/arm64/kernel/entry.S:594
         debug_print_object+0x180/0x240 lib/debugobjects.c:485
         __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
         debug_check_no_obj_freed+0x200/0x430 lib/debugobjects.c:998
         slab_free_hook mm/slub.c:1536 [inline]
         slab_free_freelist_hook+0x190/0x210 mm/slub.c:1577
         slab_free mm/slub.c:3138 [inline]
         kfree+0x13c/0x460 mm/slub.c:4119
         bond_free_slave+0x8c/0xf8 drivers/net/bonding/bond_main.c:1492
         __bond_release_one+0xe0c/0xec8 drivers/net/bonding/bond_main.c:2190
         bond_slave_netdev_event drivers/net/bonding/bond_main.c:3309 [inline]
         bond_netdev_event+0x8f0/0xa70 drivers/net/bonding/bond_main.c:3420
         notifier_call_chain+0xf0/0x200 kernel/notifier.c:83
         __raw_notifier_call_chain kernel/notifier.c:361 [inline]
         raw_notifier_call_chain+0x44/0x58 kernel/notifier.c:368
         call_netdevice_notifiers_info+0xbc/0x150 net/core/dev.c:2033
         call_netdevice_notifiers_extack net/core/dev.c:2045 [inline]
         call_netdevice_notifiers net/core/dev.c:2059 [inline]
         rollback_registered_many+0x6a4/0xec0 net/core/dev.c:9347
         unregister_netdevice_many.part.0+0x2c/0x1c0 net/core/dev.c:10509
         unregister_netdevice_many net/core/dev.c:10508 [inline]
         default_device_exit_batch+0x294/0x338 net/core/dev.c:10992
         ops_exit_list.isra.0+0xec/0x150 net/core/net_namespace.c:189
         cleanup_net+0x44c/0x888 net/core/net_namespace.c:603
         process_one_work+0x96c/0x18c0 kernel/workqueue.c:2269
         worker_thread+0x3f0/0xc30 kernel/workqueue.c:2415
         kthread+0x390/0x498 kernel/kthread.c:292
         ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:925
      
      This is a potential use-after-free if the sysfs nodes are being accessed
      whilst removing the struct slave, so wait for the object destruction to
      complete before freeing the struct slave itself.
      
      Fixes: 07699f9a ("bonding: add sysfs /slave dir for bond slave devices.")
      Fixes: a068aab4 ("bonding: Fix reference count leak in bond_sysfs_slave_add.")
      Cc: Qiushi Wu <wu000273@umn.edu>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NJamie Iles <jamie@nuviainc.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/r/20201120142827.879226-1-jamie@nuviainc.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b9ad3e9f
  16. 02 7月, 2020 1 次提交
    • J
      bonding: allow xfrm offload setup post-module-load · a3b658cf
      Jarod Wilson 提交于
      At the moment, bonding xfrm crypto offload can only be set up if the bonding
      module is loaded with active-backup mode already set. We need to be able to
      make this work with bonds set to AB after the bonding driver has already
      been loaded.
      
      So what's done here is:
      
      1) move #define BOND_XFRM_FEATURES to net/bonding.h so it can be used
      by both bond_main.c and bond_options.c
      2) set BOND_XFRM_FEATURES in bond_dev->hw_features universally, rather than
      only when loading in AB mode
      3) wire up xfrmdev_ops universally too
      4) disable BOND_XFRM_FEATURES in bond_dev->features if not AB
      5) exit early (non-AB case) from bond_ipsec_offload_ok, to prevent a
      performance hit from traversing into the underlying drivers
      5) toggle BOND_XFRM_FEATURES in bond_dev->wanted_features and call
      netdev_change_features() from bond_option_mode_set()
      
      In my local testing, I can change bonding modes back and forth on the fly,
      have hardware offload work when I'm in AB, and see no performance penalty
      to non-AB software encryption, despite having xfrm bits all wired up for
      all modes now.
      
      Fixes: 18cb261a ("bonding: support hardware encryption offload to slaves")
      Reported-by: NHuy Nguyen <huyn@mellanox.com>
      CC: Saeed Mahameed <saeedm@mellanox.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: Jakub Kicinski <kuba@kernel.org>
      CC: Steffen Klassert <steffen.klassert@secunet.com>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: netdev@vger.kernel.org
      CC: intel-wired-lan@lists.osuosl.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3b658cf
  17. 23 6月, 2020 1 次提交
    • J
      bonding: support hardware encryption offload to slaves · 18cb261a
      Jarod Wilson 提交于
      Currently, this support is limited to active-backup mode, as I'm not sure
      about the feasilibity of mapping an xfrm_state's offload handle to
      multiple hardware devices simultaneously, and we rely on being able to
      pass some hints to both the xfrm and NIC driver about whether or not
      they're operating on a slave device.
      
      I've tested this atop an Intel x520 device (ixgbe) using libreswan in
      transport mode, succesfully achieving ~4.3Gbps throughput with netperf
      (more or less identical to throughput on a bare NIC in this system),
      as well as successful failover and recovery mid-netperf.
      
      v2: just use CONFIG_XFRM_OFFLOAD for wrapping, isolate more code with it
      
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: Jakub Kicinski <kuba@kernel.org>
      CC: Steffen Klassert <steffen.klassert@secunet.com>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: netdev@vger.kernel.org
      CC: intel-wired-lan@lists.osuosl.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18cb261a
  18. 08 5月, 2020 2 次提交
  19. 05 5月, 2020 1 次提交
  20. 02 5月, 2020 2 次提交
  21. 29 2月, 2020 1 次提交
  22. 06 11月, 2019 1 次提交
    • J
      bonding: fix state transition issue in link monitoring · 1899bb32
      Jay Vosburgh 提交于
      Since de77ecd4 ("bonding: improve link-status update in
      mii-monitoring"), the bonding driver has utilized two separate variables
      to indicate the next link state a particular slave should transition to.
      Each is used to communicate to a different portion of the link state
      change commit logic; one to the bond_miimon_commit function itself, and
      another to the state transition logic.
      
      	Unfortunately, the two variables can become unsynchronized,
      resulting in incorrect link state transitions within bonding.  This can
      cause slaves to become stuck in an incorrect link state until a
      subsequent carrier state transition.
      
      	The issue occurs when a special case in bond_slave_netdev_event
      sets slave->link directly to BOND_LINK_FAIL.  On the next pass through
      bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
      case will set the proposed next state (link_new_state) to BOND_LINK_UP,
      but the new_link to BOND_LINK_DOWN.  The setting of the final link state
      from new_link comes after that from link_new_state, and so the slave
      will end up incorrectly in _DOWN state.
      
      	Resolve this by combining the two variables into one.
      Reported-by: NAleksei Zakharov <zakharov.a.g@yandex.ru>
      Reported-by: NSha Zhang <zhangsha.zhang@huawei.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Fixes: de77ecd4 ("bonding: improve link-status update in mii-monitoring")
      Signed-off-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1899bb32
  23. 25 10月, 2019 2 次提交
    • T
      net: remove unnecessary variables and callback · f3b0a18b
      Taehee Yoo 提交于
      This patch removes variables and callback these are related to the nested
      device structure.
      devices that can be nested have their own nest_level variable that
      represents the depth of nested devices.
      In the previous patch, new {lower/upper}_level variables are added and
      they replace old private nest_level variable.
      So, this patch removes all 'nest_level' variables.
      
      In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
      to get lockdep subclass value, which is actually lower nested depth value.
      But now, they use the dynamic lockdep key to avoid lockdep warning instead
      of the subclass.
      So, this patch removes ->ndo_get_lock_subclass() callback.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3b0a18b
    • T
      bonding: use dynamic lockdep key instead of subclass · 089bca2c
      Taehee Yoo 提交于
      All bonding device has same lockdep key and subclass is initialized with
      nest_level.
      But actual nest_level value can be changed when a lower device is attached.
      And at this moment, the subclass should be updated but it seems to be
      unsafe.
      So this patch makes bonding use dynamic lockdep key instead of the
      subclass.
      
      Test commands:
          ip link add bond0 type bond
      
          for i in {1..5}
          do
      	    let A=$i-1
      	    ip link add bond$i type bond
      	    ip link set bond$i master bond$A
          done
          ip link set bond5 master bond0
      
      Splat looks like:
      [  307.992912] WARNING: possible recursive locking detected
      [  307.993656] 5.4.0-rc3+ #96 Tainted: G        W
      [  307.994367] --------------------------------------------
      [  307.995092] ip/761 is trying to acquire lock:
      [  307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.997045]
      	       but task is already holding lock:
      [  307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.999215]
      	       other info that might help us debug this:
      [  308.000251]  Possible unsafe locking scenario:
      
      [  308.001137]        CPU0
      [  308.001533]        ----
      [  308.001915]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.002609]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.003302]
      		*** DEADLOCK ***
      
      [  308.004310]  May be due to missing lock nesting notation
      
      [  308.005319] 3 locks held by ip/761:
      [  308.005830]  #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
      [  308.006894]  #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  308.008243]  #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
      [  308.009422]
      	       stack backtrace:
      [  308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G        W         5.4.0-rc3+ #96
      [  308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  308.012179] Call Trace:
      [  308.012601]  dump_stack+0x7c/0xbb
      [  308.013089]  __lock_acquire+0x269d/0x3de0
      [  308.013669]  ? register_lock_class+0x14d0/0x14d0
      [  308.014318]  lock_acquire+0x164/0x3b0
      [  308.014858]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.015520]  _raw_spin_lock_nested+0x2e/0x60
      [  308.016129]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.017215]  bond_get_stats+0xb8/0x500 [bonding]
      [  308.018454]  ? bond_arp_rcv+0xf10/0xf10 [bonding]
      [  308.019710]  ? rcu_read_lock_held+0x90/0xa0
      [  308.020605]  ? rcu_read_lock_sched_held+0xc0/0xc0
      [  308.021286]  ? bond_get_stats+0x9f/0x500 [bonding]
      [  308.021953]  dev_get_stats+0x1ec/0x270
      [  308.022508]  bond_get_stats+0x1d1/0x500 [bonding]
      
      Fixes: d3fff6c4 ("net: add netdev_lockdep_set_classes() helper")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      089bca2c
  24. 05 7月, 2019 1 次提交
    • V
      bonding: add an option to specify a delay between peer notifications · 07a4ddec
      Vincent Bernat 提交于
      Currently, gratuitous ARP/ND packets are sent every `miimon'
      milliseconds. This commit allows a user to specify a custom delay
      through a new option, `peer_notif_delay'.
      
      Like for `updelay' and `downdelay', this delay should be a multiple of
      `miimon' to avoid managing an additional work queue. The configuration
      logic is copied from `updelay' and `downdelay'. However, the default
      value cannot be set using a module parameter: Netlink or sysfs should
      be used to configure this feature.
      
      When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
      observe the 500 ms delay is respected:
      
          20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
          20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
          20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
          20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
      
      In bond_mii_monitor(), I have tried to keep the lock logic readable.
      The change is due to the fact we cannot rely on a notification to
      lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
      only triggered once every N times, while we need to decrement the
      counter each time.
      
      iproute2 also needs to be updated to be able to specify this new
      attribute through `ip link'.
      Signed-off-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07a4ddec
  25. 10 6月, 2019 1 次提交
  26. 27 9月, 2018 1 次提交
    • M
      bonding: avoid possible dead-lock · d4859d74
      Mahesh Bandewar 提交于
      Syzkaller reported this on a slightly older kernel but it's still
      applicable to the current kernel -
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.18.0-next-20180823+ #46 Not tainted
      ------------------------------------------------------
      syz-executor4/26841 is trying to acquire lock:
      00000000dd41ef48 ((wq_completion)bond_dev->name){+.+.}, at: flush_workqueue+0x2db/0x1e10 kernel/workqueue.c:2652
      
      but task is already holding lock:
      00000000768ab431 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:77 [inline]
      00000000768ab431 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x412/0xc30 net/core/rtnetlink.c:4708
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (rtnl_mutex){+.+.}:
             __mutex_lock_common kernel/locking/mutex.c:925 [inline]
             __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
             mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
             rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77
             bond_netdev_notify drivers/net/bonding/bond_main.c:1310 [inline]
             bond_netdev_notify_work+0x44/0xd0 drivers/net/bonding/bond_main.c:1320
             process_one_work+0xc73/0x1aa0 kernel/workqueue.c:2153
             worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
             kthread+0x35a/0x420 kernel/kthread.c:246
             ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415
      
      -> #1 ((work_completion)(&(&nnw->work)->work)){+.+.}:
             process_one_work+0xc0b/0x1aa0 kernel/workqueue.c:2129
             worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
             kthread+0x35a/0x420 kernel/kthread.c:246
             ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415
      
      -> #0 ((wq_completion)bond_dev->name){+.+.}:
             lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
             flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
             drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
             destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
             __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
             bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
             register_netdevice+0x337/0x1100 net/core/dev.c:8410
             bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
             rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
             rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
             netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
             rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
             netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
             netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
             netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
             sock_sendmsg_nosec net/socket.c:622 [inline]
             sock_sendmsg+0xd5/0x120 net/socket.c:632
             ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
             __sys_sendmsg+0x11d/0x290 net/socket.c:2153
             __do_sys_sendmsg net/socket.c:2162 [inline]
             __se_sys_sendmsg net/socket.c:2160 [inline]
             __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
             do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)bond_dev->name --> (work_completion)(&(&nnw->work)->work) --> rtnl_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(rtnl_mutex);
                                     lock((work_completion)(&(&nnw->work)->work));
                                     lock(rtnl_mutex);
        lock((wq_completion)bond_dev->name);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor4/26841:
      
      stack backtrace:
      CPU: 1 PID: 26841 Comm: syz-executor4 Not tainted 4.18.0-next-20180823+ #46
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       print_circular_bug.isra.34.cold.55+0x1bd/0x27d kernel/locking/lockdep.c:1222
       check_prev_add kernel/locking/lockdep.c:1862 [inline]
       check_prevs_add kernel/locking/lockdep.c:1975 [inline]
       validate_chain kernel/locking/lockdep.c:2416 [inline]
       __lock_acquire+0x3449/0x5020 kernel/locking/lockdep.c:3412
       lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
       flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
       drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
       destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
       __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
       bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
       register_netdevice+0x337/0x1100 net/core/dev.c:8410
       bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
       rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
       rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
       netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
       sock_sendmsg_nosec net/socket.c:622 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:632
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
       __sys_sendmsg+0x11d/0x290 net/socket.c:2153
       __do_sys_sendmsg net/socket.c:2162 [inline]
       __se_sys_sendmsg net/socket.c:2160 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457089
      Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f2df20a5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007f2df20a66d4 RCX: 0000000000457089
      RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
      RBP: 0000000000930140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000004d40b8 R14: 00000000004c8ad8 R15: 0000000000000001
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4859d74
  27. 12 7月, 2018 1 次提交
    • P
      net: Add lag.h, net_lag_port_dev_txable() · eeed992b
      Petr Machata 提交于
      LAG devices (team or bond) recognize for each one of their slave devices
      whether LAG traffic is going to be sent through that device. Bond calls
      such devices "active", team calls them "txable". When this state
      changes, a NETDEV_CHANGELOWERSTATE notification is distributed, together
      with a netdev_notifier_changelowerstate_info structure that for LAG
      devices includes a tx_enabled flag that refers to the new state. The
      notification thus makes it possible to react to the changes in txability
      in drivers.
      
      However there's no way to query txability from the outside on demand.
      That is problematic namely for mlxsw, which when resolving ERSPAN packet
      path, may encounter a LAG device, and needs to determine which of the
      slaves it should choose.
      
      To that end, introduce a new function, net_lag_port_dev_txable(), which
      determines whether a given slave device is "active" or
      "txable" (depending on the flavor of the LAG device). That function then
      dispatches to per-LAG-flavor helpers, bond_is_active_slave_dev() resp.
      team_port_dev_txable().
      
      Because there currently is no good place where net_lag_port_dev_txable()
      should be added, introduce a new header file, lag.h, which should from
      now on hold any logic common to both team and bond. (But keep
      netif_is_lag_master() together with the rest of netif_is_*_master()
      functions).
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeed992b
  28. 17 5月, 2018 1 次提交
  29. 11 5月, 2018 1 次提交
    • D
      bonding: send learning packets for vlans on slave · 21706ee8
      Debabrata Banerjee 提交于
      There was a regression at some point from the intended functionality of
      commit f60c3704 ("bonding: Fix alb mode to only use first level
      vlans.")
      
      Given the return value vlan_get_encap_level() we need to store the nest
      level of the bond device, and then compare the vlan's encap level to
      this. Without this, this check always fails and learning packets are
      never sent.
      
      In addition, this same commit caused a regression in the behavior of
      balance_alb, which requires learning packets be sent for all interfaces
      using the slave's mac in order to load balance properly. For vlan's
      that have not set a user mac, we can send after checking one bit.
      Otherwise we need send the set mac, albeit defeating rx load balancing
      for that vlan.
      Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21706ee8
  30. 25 10月, 2017 1 次提交
  31. 05 10月, 2017 1 次提交
  32. 12 8月, 2017 1 次提交
  33. 22 4月, 2017 1 次提交
    • M
      bonding: fix wq initialization for links created via netlink · ea8ffc08
      Mahesh Bandewar 提交于
      Earlier patch 4493b81b ("bonding: initialize work-queues during
      creation of bond") moved the work-queue initialization from bond_open()
      to bond_create(). However this caused the link those are created using
      netlink 'create bond option' (ip link add bondX type bond); create the
      new trunk without initializing work-queues. Prior to the above mentioned
      change, ndo_open was in both paths and things worked correctly. The
      consequence is visible in the report shared by Joe Stringer -
      
      I've noticed that this patch breaks bonding within namespaces if
      you're not careful to perform device cleanup correctly.
      
      Here's my repro script, you can run on any net-next with this patch
      and you'll start seeing some weird behaviour:
      
      ip netns add foo
      ip li add veth0 type veth peer name veth0+ netns foo
      ip li add veth1 type veth peer name veth1+ netns foo
      ip netns exec foo ip li add bond0 type bond
      ip netns exec foo ip li set dev veth0+ master bond0
      ip netns exec foo ip li set dev veth1+ master bond0
      ip netns exec foo ip addr add dev bond0 192.168.0.1/24
      ip netns exec foo ip li set dev bond0 up
      ip li del dev veth0
      ip li del dev veth1
      
      The second to last command segfaults, last command hangs. rtnl is now
      permanently locked. It's not a problem if you take bond0 down before
      deleting veths, or delete bond0 before deleting veths. If you delete
      either end of the veth pair as per above, either inside or outside the
      namespace, it hits this problem.
      
      Here's some kernel logs:
      [ 1221.801610] bond0: Enslaving veth0+ as an active interface with an up link
      [ 1224.449581] bond0: Enslaving veth1+ as an active interface with an up link
      [ 1281.193863] bond0: Releasing backup interface veth0+
      [ 1281.193866] bond0: the permanent HWaddr of veth0+ -
      16:bf:fb:e0:b8:43 - is still in use by bond0 - set the HWaddr of
      veth0+ to a different address to avoid conflicts
      [ 1281.193867] ------------[ cut here ]------------
      [ 1281.193873] WARNING: CPU: 0 PID: 2024 at kernel/workqueue.c:1511
      __queue_delayed_work+0x13f/0x150
      [ 1281.193873] Modules linked in: bonding veth openvswitch nf_nat_ipv6
      nf_nat_ipv4 nf_nat autofs4 nfsd auth_rpcgss nfs_acl binfmt_misc nfs
      lockd grace sunrpc fscache ppdev vmw_balloon coretemp psmouse
      serio_raw vmwgfx ttm drm_kms_helper vmw_vmci netconsole parport_pc
      configfs drm i2c_piix4 fb_sys_fops syscopyarea sysfillrect sysimgblt
      shpchp mac_hid nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4
      nf_defrag_ipv4 nf_conntrack libcrc32c lp parport hid_generic usbhid
      hid mptspi mptscsih e1000 mptbase ahci libahci
      [ 1281.193905] CPU: 0 PID: 2024 Comm: ip Tainted: G        W
      4.10.0-bisect-bond-v0.14 #37
      [ 1281.193906] Hardware name: VMware, Inc. VMware Virtual
      Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
      [ 1281.193906] Call Trace:
      [ 1281.193912]  dump_stack+0x63/0x89
      [ 1281.193915]  __warn+0xd1/0xf0
      [ 1281.193917]  warn_slowpath_null+0x1d/0x20
      [ 1281.193918]  __queue_delayed_work+0x13f/0x150
      [ 1281.193920]  queue_delayed_work_on+0x27/0x40
      [ 1281.193929]  bond_change_active_slave+0x25b/0x670 [bonding]
      [ 1281.193932]  ? synchronize_rcu_expedited+0x27/0x30
      [ 1281.193935]  __bond_release_one+0x489/0x510 [bonding]
      [ 1281.193939]  ? addrconf_notify+0x1b7/0xab0
      [ 1281.193942]  bond_netdev_event+0x2c5/0x2e0 [bonding]
      [ 1281.193944]  ? netconsole_netdev_event+0x124/0x190 [netconsole]
      [ 1281.193947]  notifier_call_chain+0x49/0x70
      [ 1281.193948]  raw_notifier_call_chain+0x16/0x20
      [ 1281.193950]  call_netdevice_notifiers_info+0x35/0x60
      [ 1281.193951]  rollback_registered_many+0x23b/0x3e0
      [ 1281.193953]  unregister_netdevice_many+0x24/0xd0
      [ 1281.193955]  rtnl_delete_link+0x3c/0x50
      [ 1281.193956]  rtnl_dellink+0x8d/0x1b0
      [ 1281.193960]  rtnetlink_rcv_msg+0x95/0x220
      [ 1281.193962]  ? __kmalloc_node_track_caller+0x35/0x280
      [ 1281.193964]  ? __netlink_lookup+0xf1/0x110
      [ 1281.193966]  ? rtnl_newlink+0x830/0x830
      [ 1281.193967]  netlink_rcv_skb+0xa7/0xc0
      [ 1281.193969]  rtnetlink_rcv+0x28/0x30
      [ 1281.193970]  netlink_unicast+0x15b/0x210
      [ 1281.193971]  netlink_sendmsg+0x319/0x390
      [ 1281.193974]  sock_sendmsg+0x38/0x50
      [ 1281.193975]  ___sys_sendmsg+0x25c/0x270
      [ 1281.193978]  ? mem_cgroup_commit_charge+0x76/0xf0
      [ 1281.193981]  ? page_add_new_anon_rmap+0x89/0xc0
      [ 1281.193984]  ? lru_cache_add_active_or_unevictable+0x35/0xb0
      [ 1281.193985]  ? __handle_mm_fault+0x4e9/0x1170
      [ 1281.193987]  __sys_sendmsg+0x45/0x80
      [ 1281.193989]  SyS_sendmsg+0x12/0x20
      [ 1281.193991]  do_syscall_64+0x6e/0x180
      [ 1281.193993]  entry_SYSCALL64_slow_path+0x25/0x25
      [ 1281.193995] RIP: 0033:0x7f6ec122f5a0
      [ 1281.193995] RSP: 002b:00007ffe69e89c48 EFLAGS: 00000246 ORIG_RAX:
      000000000000002e
      [ 1281.193997] RAX: ffffffffffffffda RBX: 00007ffe69e8dd60 RCX: 00007f6ec122f5a0
      [ 1281.193997] RDX: 0000000000000000 RSI: 00007ffe69e89c90 RDI: 0000000000000003
      [ 1281.193998] RBP: 00007ffe69e89c90 R08: 0000000000000000 R09: 0000000000000003
      [ 1281.193999] R10: 00007ffe69e89a10 R11: 0000000000000246 R12: 0000000058f14b9f
      [ 1281.193999] R13: 0000000000000000 R14: 00000000006473a0 R15: 00007ffe69e8e450
      [ 1281.194001] ---[ end trace 713a77486cbfbfa3 ]---
      
      Fixes: 4493b81b ("bonding: initialize work-queues during creation of bond")
      Reported-by: NJoe Stringer <joe@ovn.org>
      Tested-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Acked-by: NAndy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea8ffc08
  34. 06 4月, 2017 1 次提交
    • J
      bonding: attempt to better support longer hw addresses · faeeb317
      Jarod Wilson 提交于
      People are using bonding over Infiniband IPoIB connections, and who knows
      what else. Infiniband has a hardware address length of 20 octets
      (INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
      Various places in the bonding code are currently hard-wired to 6 octets
      (ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
      only alb is currently possible on Infiniband links right now anyway, due
      to commit 1533e773, so the alb code is where most of the changes are.
      
      One major component of this change is the addition of a bond_hw_addr_copy
      function that takes a length argument, instead of using ether_addr_copy
      everywhere that hardware addresses need to be copied about. The other
      major component of this change is converting the bonding code from using
      struct sockaddr for address storage to struct sockaddr_storage, as the
      former has an address storage space of only 14, while the latter is 128
      minus a few, which is necessary to support bonding over device with up to
      MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
      up some memory corruption issues with the current code, where it's
      possible to write an infiniband hardware address into a sockaddr declared
      on the stack.
      
      Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
      hardware address now:
      
      $ cat /proc/net/bonding/bond0
      Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
      
      Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
      Primary Slave: mlx4_ib0 (primary_reselect always)
      Currently Active Slave: mlx4_ib0
      MII Status: up
      MII Polling Interval (ms): 100
      Up Delay (ms): 100
      Down Delay (ms): 100
      
      Slave Interface: mlx4_ib0
      MII Status: up
      Speed: Unknown
      Duplex: Unknown
      Link Failure Count: 0
      Permanent HW addr:
      80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
      Slave queue ID: 0
      
      Slave Interface: mlx4_ib1
      MII Status: up
      Speed: Unknown
      Duplex: Unknown
      Link Failure Count: 0
      Permanent HW addr:
      80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
      Slave queue ID: 0
      
      Also tested with a standard 1Gbps NIC bonding setup (with a mix of
      e1000 and e1000e cards), running LNST's bonding tests.
      
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      faeeb317