1. 19 2月, 2013 2 次提交
  2. 12 2月, 2013 1 次提交
    • N
      netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock · 2cde6acd
      Neil Horman 提交于
      __netpoll_rcu_free is used to free netpoll structures when the rtnl_lock is
      already held.  The mechanism is used to asynchronously call __netpoll_cleanup
      outside of the holding of the rtnl_lock, so as to avoid deadlock.
      Unfortunately, __netpoll_cleanup modifies pointers (dev->np), which means the
      rtnl_lock must be held while calling it.  Further, it cannot be held, because
      rcu callbacks may be issued in softirq contexts, which cannot sleep.
      
      Fix this by converting the rcu callback to a work queue that is guaranteed to
      get scheduled in process context, so that we can hold the rtnl properly while
      calling __netpoll_cleanup
      
      Tested successfully by myself.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Cong Wang <amwang@redhat.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cde6acd
  3. 05 2月, 2013 1 次提交
  4. 31 1月, 2013 1 次提交
  5. 30 1月, 2013 1 次提交
    • M
      bonding: unset primary slave via sysfs · eb492f74
      Milos Vyletel 提交于
      When bonding module is loaded with primary parameter and one decides to unset
      primary slave using sysfs these settings are not preserved during bond device
      restart. Primary slave is only unset once and it's not remembered in
      bond->params structure. Below is example of recreation.
      
       grep OPTS /etc/sysconfig/network-scripts/ifcfg-bond0
      BONDING_OPTS="mode=active-backup miimon=100 primary=eth01"
       grep "Primary Slave" /proc/net/bonding/bond0
      Primary Slave: eth01 (primary_reselect always)
      
       echo "" > /sys/class/net/bond0/bonding/primary
       grep "Primary Slave" /proc/net/bonding/bond0
      Primary Slave: None
      
       sed -i -e 's/primary=eth01//' /etc/sysconfig/network-scripts/ifcfg-bond0
       grep OPTS /etc/sysconfig/network-scripts/ifcfg-bond
      BONDING_OPTS="mode=active-backup miimon=100 "
       ifdown bond0 && ifup bond0
      
      without patch:
       grep "Primary Slave" /proc/net/bonding/bond0
      Primary Slave: eth01 (primary_reselect always)
      
      with patch:
       grep "Primary Slave" /proc/net/bonding/bond0
      Primary Slave: None
      Reviewed-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NMilos Vyletel <milos.vyletel@sde.cz>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb492f74
  6. 07 1月, 2013 1 次提交
  7. 05 1月, 2013 1 次提交
  8. 15 12月, 2012 1 次提交
  9. 08 12月, 2012 1 次提交
  10. 01 12月, 2012 2 次提交
    • J
      bonding: delete migrated IP addresses from the rlb hash table · e53665c6
      Jiri Bohac 提交于
      Bonding in balance-alb mode records information from ARP packets
      passing through the bond in a hash table (rx_hashtbl).
      
      At certain situations (e.g. link change of a slave),
      rlb_update_rx_clients() will send out ARP packets to update ARP
      caches of other hosts on the network to achieve RX load
      balancing.
      
      The problem is that once an IP address is recorded in the hash
      table, it stays there indefinitely. If this IP address is
      migrated to a different host in the network, bonding still sends
      out ARP packets that poison other systems' ARP caches with
      invalid information.
      
      This patch solves this by looking at all incoming ARP packets,
      and checking if the source IP address is one of the source
      addresses stored in the rx_hashtbl. If it is, but the MAC
      addresses differ, the corresponding hash table entries are
      removed. Thus, when an IP address is migrated, the first ARP
      broadcast by its new owner will purge the offending entries of
      rx_hashtbl.
      
      The hash table is hashed by ip_dst. To be able to do the above
      check efficiently (not walking the whole hash table), we need a
      reverse mapping (by ip_src).
      
      I added three new members in struct rlb_client_info:
         rx_hashtbl[x].src_first will point to the start of a list of
            entries for which hash(ip_src) == x.
         The list is linked with src_next and src_prev.
      
      When an incoming ARP packet arrives at rlb_arp_recv()
      rlb_purge_src_ip() can quickly walk only the entries on the
      corresponding lists, i.e. the entries that are likely to contain
      the offending IP address.
      
      To avoid confusion, I renamed these existing fields of struct
      rlb_client_info:
      	next -> used_next
      	prev -> used_prev
      	rx_hashtbl_head -> rx_hashtbl_used_head
      
      (The current linked list is _not_ a list of hash table
      entries with colliding ip_dst. It's a list of entries that are
      being used; its purpose is to avoid walking the whole hash table
      when looking for used entries.)
      Signed-off-by: NJiri Bohac <jbohac@suse.cz>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e53665c6
    • Z
      bonding: rlb mode of bond should not alter ARP originating via bridge · 567b871e
      zheng.li 提交于
      Do not modify or load balance ARP packets passing through balance-alb
      mode (wherein the ARP did not originate locally, and arrived via a bridge).
      
      Modifying pass-through ARP replies causes an incorrect MAC address
      to be placed into the ARP packet, rendering peers unable to communicate
      with the actual destination from which the ARP reply originated.
      
      Load balancing pass-through ARP requests causes an entry to be
      created for the peer in the rlb table, and bond_alb_monitor will
      occasionally issue ARP updates to all peers in the table instrucing them
      as to which MAC address they should communicate with; this occurs when
      some event sets rx_ntt.  In the bridged case, however, the MAC address
      used for the update would be the MAC of the slave, not the actual source
      MAC of the originating destination.  This would render peers unable to
      communicate with the destinations beyond the bridge.
      Signed-off-by: NZheng Li <zheng.x.li@oracle.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567b871e
  11. 30 11月, 2012 3 次提交
    • N
      bonding: fix race condition in bonding_store_slaves_active · e196c0e5
      nikolay@redhat.com 提交于
      Race between bonding_store_slaves_active() and slave manipulation
       functions. The bond_for_each_slave use in bonding_store_slaves_active()
       is not protected by any synchronization mechanism.
       NULL pointer dereference is easy to reach.
       Fixed by acquiring the bond->lock for the slave walk.
      
       v2: Make description text < 75 columns
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e196c0e5
    • N
      bonding: make arp_ip_target parameter checks consistent with sysfs · 90fb6250
      nikolay@redhat.com 提交于
      The module can be loaded with arp_ip_target="255.255.255.255" which makes
       it impossible to remove as the function in sysfs checks for that value,
       so we make the parameter checks consistent with sysfs.
      
       v2: Fix formatting
       v3: Make description text < 75 columns
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90fb6250
    • N
      bonding: fix miimon and arp_interval delayed work race conditions · fbb0c41b
      nikolay@redhat.com 提交于
      First I would give three observations which will be used later.
      Observation 1: if (delayed_work_pending(wq)) cancel_delayed_work(wq)
       This usage is wrong because the pending bit is cleared just before the
       work's fn is executed and if the function re-arms itself we might end up
       with the work still running. It's safe to call cancel_delayed_work_sync()
       even if the work is not queued at all.
      Observation 2: Use of INIT_DELAYED_WORK()
       Work needs to be initialized only once prior to (de/en)queueing.
      Observation 3: IFF_UP is set only after ndo_open is called
      
      Related race conditions:
      1. Race between bonding_store_miimon() and bonding_store_arp_interval()
       Because of Obs.1 we can end up having both works enqueued.
      2. Multiple races with INIT_DELAYED_WORK()
       Since the works are not protected by anything between INIT_DELAYED_WORK()
       and calls to (en/de)queue it is possible for races between the following
       functions:
       (races are also possible between the calls to INIT_DELAYED_WORK()
        and workqueue code)
       bonding_store_miimon() - bonding_store_arp_interval(), bond_close(),
      			  bond_open(), enqueued functions
       bonding_store_arp_interval() - bonding_store_miimon(), bond_close(),
      				bond_open(), enqueued functions
      3. By Obs.1 we need to change bond_cancel_all()
      
      Bugs 1 and 2 are fixed by moving all work initializations in bond_open
      which by Obs. 2 and Obs. 3 and the fact that we make sure that all works
      are cancelled in bond_close(), is guaranteed not to have any work
      enqueued.
      Also RTNL lock is now acquired in bonding_store_miimon/arp_interval so
      they can't race with bond_close and bond_open. The opposing work is
      cancelled only if the IFF_UP flag is set and it is cancelled
      unconditionally. The opposing work is already cancelled if the interface
      is down so no need to cancel it again. This way we don't need new
      synchronizations for the bonding workqueue. These bugs (and fixes) are
      tied together and belong in the same patch.
      Note: I have left 1 line intentionally over 80 characters (84) because I
            didn't like how it looks broken down. If you'd prefer it otherwise,
            then simply break it.
      
       v2: Make description text < 75 columns
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbb0c41b
  12. 29 11月, 2012 1 次提交
    • M
      bonding: in balance-rr mode, set curr_active_slave only if it is up · 4e591b93
      Michal Kubeček 提交于
      If all slaves of a balance-rr bond with ARP monitor are enslaved
      with down link state, bond keeps down state even after slaves
      go up.
      
      This is caused by bond_enslave() setting curr_active_slave to
      first slave not taking into account its link state. As
      bond_loadbalance_arp_mon() uses curr_active_slave to identify
      whether slave's down->up transition should update bond's link
      state, bond stays down even if slaves are up (until first slave
      goes from up to down at least once).
      
      Before commit f31c7937 "bonding: start slaves with link down for
      ARP monitor", this was masked by slaves always starting in UP
      state with ARP monitor (and MII monitor not relying on
      curr_active_slave being NULL if there is no slave up).
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e591b93
  13. 22 11月, 2012 1 次提交
  14. 19 11月, 2012 1 次提交
  15. 01 11月, 2012 2 次提交
  16. 17 10月, 2012 1 次提交
  17. 05 10月, 2012 1 次提交
    • E
      bonding: set qdisc_tx_busylock to avoid LOCKDEP splat · 49ee4920
      Eric Dumazet 提交于
      If a qdisc is installed on a bonding device, its possible to get
      following lockdep splat under stress :
      
       =============================================
       [ INFO: possible recursive locking detected ]
       3.6.0+ #211 Not tainted
       ---------------------------------------------
       ping/4876 is trying to acquire lock:
        (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
      
       but task is already holding lock:
        (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
         lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       6 locks held by ping/4876:
        #0:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e5030>] raw_sendmsg+0x600/0xc30
        #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff815ba4bd>] ip_finish_output+0x12d/0x870
        #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
        #3:  (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
        #4:  (&bond->lock){++.?..}, at: [<ffffffffa02128c1>] bond_start_xmit+0x31/0x4b0 [bonding]
        #5:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
      
       stack backtrace:
       Pid: 4876, comm: ping Not tainted 3.6.0+ #211
       Call Trace:
        [<ffffffff810a0145>] __lock_acquire+0x715/0x1b80
        [<ffffffff810a256b>] ? mark_held_locks+0x9b/0x100
        [<ffffffff810a1bf2>] lock_acquire+0x92/0x1d0
        [<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
        [<ffffffff81726b7c>] _raw_spin_lock+0x3c/0x50
        [<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
        [<ffffffff8106264d>] ? rcu_read_lock_bh_held+0x5d/0x90
        [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
        [<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
        [<ffffffffa0212a6a>] bond_start_xmit+0x1da/0x4b0 [bonding]
        [<ffffffff815796d0>] dev_hard_start_xmit+0x240/0x6b0
        [<ffffffff81597c6e>] sch_direct_xmit+0xfe/0x2a0
        [<ffffffff8157a249>] dev_queue_xmit+0x199/0x830
        [<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
        [<ffffffff815ba96f>] ip_finish_output+0x5df/0x870
        [<ffffffff815ba4bd>] ? ip_finish_output+0x12d/0x870
        [<ffffffff815bb964>] ip_output+0x54/0xf0
        [<ffffffff815bad48>] ip_local_out+0x28/0x90
        [<ffffffff815bc444>] ip_send_skb+0x14/0x50
        [<ffffffff815bc4b2>] ip_push_pending_frames+0x32/0x40
        [<ffffffff815e536a>] raw_sendmsg+0x93a/0xc30
        [<ffffffff8128d570>] ? selinux_file_send_sigiotask+0x1f0/0x1f0
        [<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
        [<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
        [<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
        [<ffffffff815f6855>] inet_sendmsg+0x125/0x240
        [<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
        [<ffffffff8155cddb>] sock_sendmsg+0xab/0xe0
        [<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
        [<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
        [<ffffffff8155d18c>] __sys_sendmsg+0x37c/0x390
        [<ffffffff81195b2a>] ? fsnotify+0x2ca/0x7e0
        [<ffffffff811958e8>] ? fsnotify+0x88/0x7e0
        [<ffffffff81361f36>] ? put_ldisc+0x56/0xd0
        [<ffffffff8116f98a>] ? fget_light+0x3da/0x510
        [<ffffffff8155f6c4>] sys_sendmsg+0x44/0x80
        [<ffffffff8172fc22>] system_call_fastpath+0x16/0x1b
      
      Avoid this problem using a distinct lock_class_key for bonding
      devices.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ee4920
  18. 01 9月, 2012 1 次提交
    • J
      bonding: add some slack to arp monitoring time limits · da210f55
      Jiri Bohac 提交于
      Currently, all the time limits in the bonding ARP monitor are in
      multiples of arp_interval -- the time interval at which the ARP
      monitor is periodically scheduled.
      
      With a fast network round-trip and a little scheduling latency
      of the ARP monitor work, a limit of n*delta_in_ticks may
      effectively mean (n-1)*delta_in_ticks.
      
      This is fatal in case of n==1  (the link will stay down
      forever) and makes the behaviour non-deterministic in all the
      other cases.
      
      Add a delta_in_ticks/2 time slack to all the time limits.
      Signed-off-by: NJiri Bohac <jbohac@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da210f55
  19. 23 8月, 2012 1 次提交
    • J
      bonding: support for IPv6 transmit hashing · 6b923cb7
      John Eaglesham 提交于
      Currently the "bonding" driver does not support load balancing outgoing
      traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
      are currently supported; this patch adds transmit hashing for IPv6 (and
      TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
      bonding driver. In addition, bounds checking has been added to all
      transmit hashing functions.
      
      The algorithm chosen (xor'ing the bottom three quads of the source and
      destination addresses together, then xor'ing each byte of that result into
      the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
      was selected after testing almost 400,000 unique IPv6 addresses harvested
      from server logs. This algorithm had the most even distribution for both
      big- and little-endian architectures while still using few instructions. Its
      behavior also attempts to closely match that of the IPv4 algorithm.
      
      The IPv6 flow label was intentionally not included in the hash as it appears
      to be unset in the vast majority of IPv6 traffic sampled, and the current
      algorithm not using the flow label already offers a very even distribution.
      
      Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
      ie, they are not balanced based on layer 4 information. Additionally,
      IPv6 packets with intermediate headers are not balanced based on layer
      4 information. In practice these intermediate headers are not common and
      this should not cause any problems, and the alternative (a packet-parsing
      loop and look-up table) seemed slow and complicated for little gain.
      Tested-by: NJohn Eaglesham <linux@8192.net>
      Signed-off-by: NJohn Eaglesham <linux@8192.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b923cb7
  20. 15 8月, 2012 4 次提交
  21. 21 7月, 2012 3 次提交
  22. 19 7月, 2012 1 次提交
  23. 18 7月, 2012 1 次提交
  24. 10 7月, 2012 2 次提交
  25. 18 6月, 2012 1 次提交
  26. 14 6月, 2012 1 次提交
  27. 13 6月, 2012 3 次提交
    • E
      bonding: remove packet cloning in recv_probe() · de063b70
      Eric Dumazet 提交于
      Cloning all packets in input path have a significant cost.
      
      Use skb_header_pointer()/skb_copy_bits() instead of pskb_may_pull() so
      that recv_probe handlers (bond_3ad_lacpdu_recv / bond_arp_rcv /
      rlb_arp_recv ) dont touch input skb.
      
      bond_handle_frame() can avoid the skb_clone()/dev_kfree_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
      Cc: Maciej Żenczykowski <maze@google.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de063b70
    • E
      bonding: Fix corrupted queue_mapping · 5ee31c68
      Eric Dumazet 提交于
      In the transmit path of the bonding driver, skb->cb is used to
      stash the skb->queue_mapping so that the bonding device can set its
      own queue mapping.  This value becomes corrupted since the skb->cb is
      also used in __dev_xmit_skb.
      
      When transmitting through bonding driver, bond_select_queue is
      called from dev_queue_xmit.  In bond_select_queue the original
      skb->queue_mapping is copied into skb->cb (via bond_queue_mapping)
      and skb->queue_mapping is overwritten with the bond driver queue.
      
      Subsequently in dev_queue_xmit, __dev_xmit_skb is called which writes
      the packet length into skb->cb, thereby overwriting the stashed
      queue mappping.  In bond_dev_queue_xmit (called from hard_start_xmit),
      the queue mapping for the skb is set to the stashed value which is now
      the skb length and hence is an invalid queue for the slave device.
      
      If we want to save skb->queue_mapping into skb->cb[], best place is to
      add a field in struct qdisc_skb_cb, to make sure it wont conflict with
      other layers (eg : Qdiscc, Infiniband...)
      
      This patchs also makes sure (struct qdisc_skb_cb)->data is aligned on 8
      bytes :
      
      netem qdisc for example assumes it can store an u64 in it, without
      misalignment penalty.
      
      Note : we only have 20 bytes left in (struct qdisc_skb_cb)->data[].
      The largest user is CHOKe and it fills it.
      
      Based on a previous patch from Tom Herbert.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NTom Herbert <therbert@google.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Roland Dreier <roland@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ee31c68
    • W
      bonding:record primary when modify it via sysfs · 8a93664d
      Weiping Pan 提交于
      If we modify primary via sysfs and it is not a valid slave,
      we should record it for future use, and this behavior is the same with
      bond_check_params().
      Signed-off-by: NWeiping Pan <wpan@redhat.com>
      Acked-by: NNicolas de Pesloüan <nicolas.2p.debian@free.fr>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a93664d