1. 05 10月, 2012 1 次提交
    • E
      bonding: set qdisc_tx_busylock to avoid LOCKDEP splat · 49ee4920
      Eric Dumazet 提交于
      If a qdisc is installed on a bonding device, its possible to get
      following lockdep splat under stress :
      
       =============================================
       [ INFO: possible recursive locking detected ]
       3.6.0+ #211 Not tainted
       ---------------------------------------------
       ping/4876 is trying to acquire lock:
        (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
      
       but task is already holding lock:
        (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
         lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       6 locks held by ping/4876:
        #0:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e5030>] raw_sendmsg+0x600/0xc30
        #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff815ba4bd>] ip_finish_output+0x12d/0x870
        #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
        #3:  (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
        #4:  (&bond->lock){++.?..}, at: [<ffffffffa02128c1>] bond_start_xmit+0x31/0x4b0 [bonding]
        #5:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
      
       stack backtrace:
       Pid: 4876, comm: ping Not tainted 3.6.0+ #211
       Call Trace:
        [<ffffffff810a0145>] __lock_acquire+0x715/0x1b80
        [<ffffffff810a256b>] ? mark_held_locks+0x9b/0x100
        [<ffffffff810a1bf2>] lock_acquire+0x92/0x1d0
        [<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
        [<ffffffff81726b7c>] _raw_spin_lock+0x3c/0x50
        [<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
        [<ffffffff8106264d>] ? rcu_read_lock_bh_held+0x5d/0x90
        [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
        [<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
        [<ffffffffa0212a6a>] bond_start_xmit+0x1da/0x4b0 [bonding]
        [<ffffffff815796d0>] dev_hard_start_xmit+0x240/0x6b0
        [<ffffffff81597c6e>] sch_direct_xmit+0xfe/0x2a0
        [<ffffffff8157a249>] dev_queue_xmit+0x199/0x830
        [<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
        [<ffffffff815ba96f>] ip_finish_output+0x5df/0x870
        [<ffffffff815ba4bd>] ? ip_finish_output+0x12d/0x870
        [<ffffffff815bb964>] ip_output+0x54/0xf0
        [<ffffffff815bad48>] ip_local_out+0x28/0x90
        [<ffffffff815bc444>] ip_send_skb+0x14/0x50
        [<ffffffff815bc4b2>] ip_push_pending_frames+0x32/0x40
        [<ffffffff815e536a>] raw_sendmsg+0x93a/0xc30
        [<ffffffff8128d570>] ? selinux_file_send_sigiotask+0x1f0/0x1f0
        [<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
        [<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
        [<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
        [<ffffffff815f6855>] inet_sendmsg+0x125/0x240
        [<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
        [<ffffffff8155cddb>] sock_sendmsg+0xab/0xe0
        [<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
        [<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
        [<ffffffff8155d18c>] __sys_sendmsg+0x37c/0x390
        [<ffffffff81195b2a>] ? fsnotify+0x2ca/0x7e0
        [<ffffffff811958e8>] ? fsnotify+0x88/0x7e0
        [<ffffffff81361f36>] ? put_ldisc+0x56/0xd0
        [<ffffffff8116f98a>] ? fget_light+0x3da/0x510
        [<ffffffff8155f6c4>] sys_sendmsg+0x44/0x80
        [<ffffffff8172fc22>] system_call_fastpath+0x16/0x1b
      
      Avoid this problem using a distinct lock_class_key for bonding
      devices.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ee4920
  2. 01 9月, 2012 1 次提交
    • J
      bonding: add some slack to arp monitoring time limits · da210f55
      Jiri Bohac 提交于
      Currently, all the time limits in the bonding ARP monitor are in
      multiples of arp_interval -- the time interval at which the ARP
      monitor is periodically scheduled.
      
      With a fast network round-trip and a little scheduling latency
      of the ARP monitor work, a limit of n*delta_in_ticks may
      effectively mean (n-1)*delta_in_ticks.
      
      This is fatal in case of n==1  (the link will stay down
      forever) and makes the behaviour non-deterministic in all the
      other cases.
      
      Add a delta_in_ticks/2 time slack to all the time limits.
      Signed-off-by: NJiri Bohac <jbohac@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da210f55
  3. 23 8月, 2012 1 次提交
    • J
      bonding: support for IPv6 transmit hashing · 6b923cb7
      John Eaglesham 提交于
      Currently the "bonding" driver does not support load balancing outgoing
      traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
      are currently supported; this patch adds transmit hashing for IPv6 (and
      TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
      bonding driver. In addition, bounds checking has been added to all
      transmit hashing functions.
      
      The algorithm chosen (xor'ing the bottom three quads of the source and
      destination addresses together, then xor'ing each byte of that result into
      the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
      was selected after testing almost 400,000 unique IPv6 addresses harvested
      from server logs. This algorithm had the most even distribution for both
      big- and little-endian architectures while still using few instructions. Its
      behavior also attempts to closely match that of the IPv4 algorithm.
      
      The IPv6 flow label was intentionally not included in the hash as it appears
      to be unset in the vast majority of IPv6 traffic sampled, and the current
      algorithm not using the flow label already offers a very even distribution.
      
      Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
      ie, they are not balanced based on layer 4 information. Additionally,
      IPv6 packets with intermediate headers are not balanced based on layer
      4 information. In practice these intermediate headers are not common and
      this should not cause any problems, and the alternative (a packet-parsing
      loop and look-up table) seemed slow and complicated for little gain.
      Tested-by: NJohn Eaglesham <linux@8192.net>
      Signed-off-by: NJohn Eaglesham <linux@8192.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b923cb7
  4. 15 8月, 2012 4 次提交
  5. 21 7月, 2012 2 次提交
  6. 19 7月, 2012 1 次提交
  7. 18 7月, 2012 1 次提交
  8. 10 7月, 2012 1 次提交
  9. 14 6月, 2012 1 次提交
  10. 13 6月, 2012 2 次提交
    • E
      bonding: remove packet cloning in recv_probe() · de063b70
      Eric Dumazet 提交于
      Cloning all packets in input path have a significant cost.
      
      Use skb_header_pointer()/skb_copy_bits() instead of pskb_may_pull() so
      that recv_probe handlers (bond_3ad_lacpdu_recv / bond_arp_rcv /
      rlb_arp_recv ) dont touch input skb.
      
      bond_handle_frame() can avoid the skb_clone()/dev_kfree_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
      Cc: Maciej Żenczykowski <maze@google.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de063b70
    • E
      bonding: Fix corrupted queue_mapping · 5ee31c68
      Eric Dumazet 提交于
      In the transmit path of the bonding driver, skb->cb is used to
      stash the skb->queue_mapping so that the bonding device can set its
      own queue mapping.  This value becomes corrupted since the skb->cb is
      also used in __dev_xmit_skb.
      
      When transmitting through bonding driver, bond_select_queue is
      called from dev_queue_xmit.  In bond_select_queue the original
      skb->queue_mapping is copied into skb->cb (via bond_queue_mapping)
      and skb->queue_mapping is overwritten with the bond driver queue.
      
      Subsequently in dev_queue_xmit, __dev_xmit_skb is called which writes
      the packet length into skb->cb, thereby overwriting the stashed
      queue mappping.  In bond_dev_queue_xmit (called from hard_start_xmit),
      the queue mapping for the skb is set to the stashed value which is now
      the skb length and hence is an invalid queue for the slave device.
      
      If we want to save skb->queue_mapping into skb->cb[], best place is to
      add a field in struct qdisc_skb_cb, to make sure it wont conflict with
      other layers (eg : Qdiscc, Infiniband...)
      
      This patchs also makes sure (struct qdisc_skb_cb)->data is aligned on 8
      bytes :
      
      netem qdisc for example assumes it can store an u64 in it, without
      misalignment penalty.
      
      Note : we only have 20 bytes left in (struct qdisc_skb_cb)->data[].
      The largest user is CHOKe and it fills it.
      
      Based on a previous patch from Tom Herbert.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NTom Herbert <therbert@google.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Roland Dreier <roland@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ee31c68
  11. 11 5月, 2012 2 次提交
    • J
      drivers/net: Convert compare_ether_addr to ether_addr_equal · 2e42e474
      Joe Perches 提交于
      Use the new bool function ether_addr_equal to add
      some clarity and reduce the likelihood for misuse
      of compare_ether_addr for sorting.
      
      Done via cocci script:
      
      $ cat compare_ether_addr.cocci
      @@
      expression a,b;
      @@
      -	!compare_ether_addr(a, b)
      +	ether_addr_equal(a, b)
      
      @@
      expression a,b;
      @@
      -	compare_ether_addr(a, b)
      +	!ether_addr_equal(a, b)
      
      @@
      expression a,b;
      @@
      -	!ether_addr_equal(a, b) == 0
      +	ether_addr_equal(a, b)
      
      @@
      expression a,b;
      @@
      -	!ether_addr_equal(a, b) != 0
      +	!ether_addr_equal(a, b)
      
      @@
      expression a,b;
      @@
      -	ether_addr_equal(a, b) == 0
      +	!ether_addr_equal(a, b)
      
      @@
      expression a,b;
      @@
      -	ether_addr_equal(a, b) != 0
      +	ether_addr_equal(a, b)
      
      @@
      expression a,b;
      @@
      -	!!ether_addr_equal(a, b)
      +	ether_addr_equal(a, b)
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e42e474
    • J
      bonding: don't increase rx_dropped after processing LACPDUs · 13a8e0c8
      Jiri Bohac 提交于
      Since commit 3aba891d, bonding processes LACP frames (802.3ad
      mode) with bond_handle_frame(). Currently a copy of the skb is
      made and the original is left to be processed by other
      rx_handlers and the rest of the network stack by returning
      RX_HANDLER_ANOTHER.  As there is no protocol handler for
      PKT_TYPE_LACPDU, the frame is dropped and dev->rx_dropped
      increased.
      
      Fix this by making bond_handle_frame() return RX_HANDLER_CONSUMED
      if bonding has processed the LACP frame.
      Signed-off-by: NJiri Bohac <jbohac@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13a8e0c8
  12. 27 4月, 2012 1 次提交
  13. 20 4月, 2012 1 次提交
  14. 14 4月, 2012 2 次提交
  15. 06 4月, 2012 1 次提交
  16. 05 4月, 2012 2 次提交
  17. 30 3月, 2012 1 次提交
    • W
      bonding: emit event when bonding changes MAC · 7d26bb10
      Weiping Pan 提交于
      When a bonding device is configured with fail_over_mac=active,
      we expect to see the MAC address of the new active slave as the source MAC
      address after failover. But we see that the source MAC address is the MAC
      address of previous active slave.
      
      Emit NETDEV_CHANGEADDR event when bonding changes its MAC address, in order
      to let arp_netdev_event flush neighbour cache and route cache.
      
      How to reproduce this bug ?
      
                             -----------hostB----------------
      hostA ----- switch ---|-- eth0--bond0(192.168.100.2/24)|
      (192.168.100.1/24  \--|-- eth1-/                       |
                             --------------------------------
      
      1 on hostB,
      modprobe bonding mode=1 miimon=500 fail_over_mac=active downdelay=1000
      num_grat_arp=1
      ifconfig bond0 192.168.100.2/24 up
      ifenslave bond0 eth0
      ifenslave bond0 eth1
      
      then eth0 is the active slave, and MAC of bond0 is MAC of eth0.
      
      2 on hostA, ping 192.168.100.2
      
      3 on hostB,
      tcpdump -i bond0 -p icmp -XXX
      you will see bond0 uses MAC of eth0 as source MAC in icmp reply.
      
      4 on hostB,
      ifconfig eth0 down
      tcpdump -i bond0 -p icmp -XXX (just keep it running in step 3)
      you will see first bond0 uses MAC of eth1 as source MAC in icmp
      reply, then it will use MAC of eth0 as source MAC.
      Signed-off-by: NWeiping Pan <wpan@redhat.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d26bb10
  18. 29 3月, 2012 1 次提交
  19. 23 3月, 2012 1 次提交
    • A
      bonding: remove entries for master_ip and vlan_ip and query devices instead · eaddcd76
      Andy Gospodarek 提交于
      The following patch aimed to resolve an issue where secondary, tertiary,
      etc. addresses added to bond interfaces could overwrite the
      bond->master_ip and vlan_ip values.
      
              commit 917fbdb3
              Author: Henrik Saavedra Persson <henrik.e.persson@ericsson.com>
              Date:   Wed Nov 23 23:37:15 2011 +0000
      
                  bonding: only use primary address for ARP
      
      That patch was good because it prevented bonds using ARP monitoring from
      sending frames with an invalid source IP address.  Unfortunately, it
      didn't always work as expected.
      
      When using an ioctl (like ifconfig does) to set the IP address and
      netmask, 2 separate ioctls are actually called to set the IP and netmask
      if the mask chosen doesn't match the standard mask for that class of
      address.  The first ioctl did not have a mask that matched the one in
      the primary address and would still cause the device address to be
      overwritten.  The second ioctl that was called to set the mask would
      then detect as secondary and ignored, but the damage was already done.
      
      This was not an issue when using an application that used netlink
      sockets as the setting of IP and netmask came down at once.  The
      inconsistent behavior between those two interfaces was something that
      needed to be resolved.
      
      While I was thinking about how I wanted to resolve this, Ralf Zeidler
      came with a patch that resolved this on a RHEL kernel by keeping a full
      shadow of the entries in dev->ifa_list for the bonding device and vlan
      devices in the bonding driver.  I didn't like the duplication of the
      list as I want to see the 'bonding' struct and code shrink rather than
      grow, but liked the general idea.
      
      As the Subject indicates this patch drops the master_ip and vlan_ip
      elements from the 'bonding' and 'vlan_entry' structs, respectively.
      This can be done because a device's address-list is now traversed to
      determine the optimal source IP address for ARP requests and for checks
      to see if the bonding device has a particular IP address.  This code
      could have all be contained inside the bonding driver, but it made more
      sense to me to EXPORT and call inet_confirm_addr since it did exactly
      what was needed.
      
      I tested this and a backported patch and everything works as expected.
      Ralf also helped with verification of the backported patch.
      
      Thanks to Ralf for all his help on this.
      
      v2: Whitespace and organizational changes based on suggestions from Jay
      Vosburgh and Dave Miller.
      
      v3: Fixup incorrect usage of rcu_read_unlock based on Dave Miller's
      suggestion.
      Signed-off-by: NAndy Gospodarek <andy@greyhouse.net>
      CC: Ralf Zeidler <ralf.zeidler@nsn.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaddcd76
  20. 20 3月, 2012 1 次提交
    • P
      bonding: send igmp report for its master · 1c3ac428
      Peter Pan(潘卫平) 提交于
      Liang Zheng(lzheng@redhat.com) found that in the following topo,
      bonding does not send igmp report when we trigger a fail-over of bonding.
      
      eth0--
            |-- bond0 -- br0
      eth1--
      
      modprobe bonding mode=1 miimon=100 resend_igmp=10
      ifconfig bond0 up
      ifenslave bond0 eth0 eth1
      
      brctl addbr br0
      ifconfig br0 192.168.100.2/24 up
      brctl addif br0 bond0
      
      Add 192.168.100.2(br0) into a multicast group, like 224.10.10.10,
      then trigger a fali-over in bonding.
      You can see that parameter "resend_igmp" does not work.
      
      The reason is that when we add br0 into a multicast group,
      it does not propagate multicast knowledge down to its ports.
      
      If we choose to propagate multicast knowledge down to all ports for bridge,
      then we have to track every change that is done to bridge, and keep a backup
      for all ports. It is hard to track, I think.
      
      Instead I choose to modify bonding to send igmp report for its master.
      
      Changelog:
      V2: correct comments
      V3: move this check into bond_resend_igmp_join_requests()
      V4: only send igmp reports if bond is enslaved to a bridge
      Signed-off-by: NWeiping Pan <panweiping3@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c3ac428
  21. 04 1月, 2012 1 次提交
  22. 09 12月, 2011 2 次提交
  23. 01 12月, 2011 1 次提交
  24. 17 11月, 2011 2 次提交
  25. 05 11月, 2011 1 次提交
  26. 02 11月, 2011 1 次提交
    • W
      bonding:update speed/duplex for NETDEV_CHANGE · 98f41f69
      Weiping Pan 提交于
      Zheng Liang(lzheng@redhat.com) found a bug that if we config bonding with
      arp monitor, sometimes bonding driver cannot get the speed and duplex from
      its slaves, it will assume them to be 100Mb/sec and Full, please see
      /proc/net/bonding/bond0.
      But there is no such problem when uses miimon.
      
      (Take igb for example)
      I find that the reason is that after dev_open() in bond_enslave(),
      bond_update_speed_duplex() will call igb_get_settings()
      , but in that function,
      it runs ethtool_cmd_speed_set(ecmd, -1); ecmd->duplex = -1;
      because igb get an error value of status.
      So even dev_open() is called, but the device is not really ready to get its
      settings.
      
      Maybe it is safe for us to call igb_get_settings() only after
      this message shows up, that is "igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex,
      Flow Control: RX".
      
      So I prefer to update the speed and duplex for a slave when reseices
      NETDEV_CHANGE/NETDEV_UP event.
      
      Changelog
      V2:
      1 remove the "fake 100/Full" logic in bond_update_speed_duplex(),
        set speed and duplex to -1 when it gets error value of speed and duplex.
      2 delete the warning in bond_enslave() if bond_update_speed_duplex() returns
        error.
      3 make bond_info_show_slave() handle bad values of speed and duplex.
      Signed-off-by: NWeiping Pan <wpan@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98f41f69
  27. 30 10月, 2011 1 次提交
    • J
      bonding: eliminate bond_close race conditions · e6d265e8
      Jay Vosburgh 提交于
      This patch resolves two sets of race conditions.
      
      	Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
      first, as follows:
      
      The bond_close() calls cancel_delayed_work() to cancel delayed works.
      It, however, cannot cancel works that were already queued in workqueue.
      The bond_open() initializes work->data, and proccess_one_work() refers
      get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
      work->data has been initialized. Thus, a panic occurs.
      
      	He included a patch that converted the cancel_delayed_work calls
      in bond_close to flush_delayed_work_sync, which eliminated the above
      problem.
      
      	His patch is incorporated, at least in principle, into this
      patch.  In this patch, we use cancel_delayed_work_sync in place of
      flush_delayed_work_sync, and also convert bond_uninit in addition to
      bond_close.
      
      	This conversion to _sync, however, opens new races between
      bond_close and three periodically executing workqueue functions:
      bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
      
      	The race occurs because bond_close and bond_uninit are always
      called with RTNL held, and these workqueue functions may acquire RTNL to
      perform failover-related activities.  If bond_close or bond_uninit is
      waiting in cancel_delayed_work_sync, deadlock occurs.
      
      	These deadlocks are resolved by having the workqueue functions
      acquire RTNL conditionally.  If the rtnl_trylock() fails, the functions
      reschedule and return immediately.  For the cases that are attempting to
      perform link failover, a delay of 1 is used; for the other cases, the
      normal interval is used (as those activities are not as time critical).
      
      	Additionally, the bond_mii_monitor function now stores the delay
      in a variable (mimicing the structure of activebackup_arp_mon).
      
      	Lastly, all of the above renders the kill_timers sentinel moot,
      and therefore it has been removed.
      Tested-by: NMitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6d265e8
  28. 26 10月, 2011 1 次提交
  29. 20 10月, 2011 1 次提交
  30. 19 10月, 2011 1 次提交