1. 08 12月, 2010 1 次提交
  2. 28 10月, 2010 3 次提交
  3. 27 10月, 2010 1 次提交
    • E
      fib: fix fib_nl_newrule() · ebb9fed2
      Eric Dumazet 提交于
      Some panic reports in fib_rules_lookup() show a rule could have a NULL
      pointer as a next pointer in the rules_list.
      
      This can actually happen because of a bug in fib_nl_newrule() : It
      checks if current rule is the destination of unresolved gotos. (Other
      rules have gotos to this about to be inserted rule)
      
      Problem is it does the resolution of the gotos before the rule is
      inserted in the rules_list (and has a valid next pointer)
      
      Fix this by moving the rules_list insertion before the changes on gotos.
      
      A lockless reader can not any more follow a ctarget pointer, unless
      destination is ready (has a valid next pointer)
      Reported-by: NOleg A. Arkhangelsky <sysoleg@yandex.ru>
      Reported-by: NJoe Buehler <aspam@cox.net>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb9fed2
  4. 26 10月, 2010 5 次提交
  5. 25 10月, 2010 1 次提交
  6. 21 10月, 2010 7 次提交
  7. 20 10月, 2010 5 次提交
  8. 18 10月, 2010 2 次提交
    • N
      bonding: Fix napi poll for bonding driver · 990c3d6f
      Neil Horman 提交于
      Usually the netpoll path, when preforming a napi poll can get away with just
      polling all the napi instances of the configured device.  Thats not the case for
      the bonding driver however, as the napi instances which may wind up getting
      flagged as needing polling after the poll_controller call don't belong to the
      bonded device, but rather to the slave devices.  Fix this by checking the device
      in question for the IFF_MASTER flag, if set, we know we need to check the full
      poll list for this cpu, rather than just the devices napi instance list.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      990c3d6f
    • N
      bonding: Fix bonding drivers improper modification of netpoll structure · c2355e1a
      Neil Horman 提交于
      The bonding driver currently modifies the netpoll structure in its xmit path
      while sending frames from netpoll.  This is racy, as other cpus can access the
      netpoll structure in parallel. Since the bonding driver points np->dev to a
      slave device, other cpus can inadvertently attempt to send data directly to
      slave devices, leading to improper locking with the bonding master, lost frames,
      and deadlocks.  This patch fixes that up.
      
      This patch also removes the real_dev pointer from the netpoll structure as that
      data is really only used by bonding in the poll_controller, and we can emulate
      its behavior by check each slave for IS_UP.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2355e1a
  9. 17 10月, 2010 2 次提交
    • E
      fib: remove a useless synchronize_rcu() call · a0a4a85a
      Eric Dumazet 提交于
      fib_nl_delrule() calls synchronize_rcu() for no apparent reason,
      while rtnl is held.
      
      I suspect it was done to avoid an atomic_inc_not_zero() in
      fib_rules_lookup(), which commit 7fa7cb71 added anyway.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0a4a85a
    • E
      net: allocate skbs on local node · 564824b0
      Eric Dumazet 提交于
      commit b30973f8 (node-aware skb allocation) spread a wrong habit of
      allocating net drivers skbs on a given memory node : The one closest to
      the NIC hardware. This is wrong because as soon as we try to scale
      network stack, we need to use many cpus to handle traffic and hit
      slub/slab management on cross-node allocations/frees when these cpus
      have to alloc/free skbs bound to a central node.
      
      skb allocated in RX path are ephemeral, they have a very short
      lifetime : Extra cost to maintain NUMA affinity is too expensive. What
      appeared as a nice idea four years ago is in fact a bad one.
      
      In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
      and two 10Gb NIC might deliver more than 28 million packets per second,
      needing all the available cpus.
      
      Cost of cross-node handling in network and vm stacks outperforms the
      small benefit hardware had when doing its DMA transfert in its 'local'
      memory node at RX time. Even trying to differentiate the two allocations
      done for one skb (the sk_buff on local node, the data part on NIC
      hardware node) is not enough to bring good performance.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564824b0
  10. 13 10月, 2010 1 次提交
    • E
      net: percpu net_device refcount · 29b4433d
      Eric Dumazet 提交于
      We tried very hard to remove all possible dev_hold()/dev_put() pairs in
      network stack, using RCU conversions.
      
      There is still an unavoidable device refcount change for every dst we
      create/destroy, and this can slow down some workloads (routers or some
      app servers, mmap af_packet)
      
      We can switch to a percpu refcount implementation, now dynamic per_cpu
      infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
      per device.
      
      On x86, dev_hold(dev) code :
      
      before
              lock    incl 0x280(%ebx)
      after:
              movl    0x260(%ebx),%eax
              incl    fs:(%eax)
      
      Stress bench :
      
      (Sending 160.000.000 UDP frames,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_TRIE)
      
      Before:
      
      real    1m1.662s
      user    0m14.373s
      sys     12m55.960s
      
      After:
      
      real    0m51.179s
      user    0m15.329s
      sys     10m15.942s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29b4433d
  11. 12 10月, 2010 4 次提交
    • E
      net dst: use a percpu_counter to track entries · fc66f95c
      Eric Dumazet 提交于
      struct dst_ops tracks number of allocated dst in an atomic_t field,
      subject to high cache line contention in stress workload.
      
      Switch to a percpu_counter, to reduce number of time we need to dirty a
      central location. Place it on a separate cache line to avoid dirtying
      read only fields.
      
      Stress test :
      
      (Sending 160.000.000 UDP frames,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_TRIE, SLUB/NUMA)
      
      Before:
      
      real    0m51.179s
      user    0m15.329s
      sys     10m15.942s
      
      After:
      
      real	0m45.570s
      user	0m15.525s
      sys	9m56.669s
      
      With a small reordering of struct neighbour fields, subject of a
      following patch, (to separate refcnt from other read mostly fields)
      
      real	0m41.841s
      user	0m15.261s
      sys	8m45.949s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc66f95c
    • E
      neigh: Protect neigh->ha[] with a seqlock · 0ed8ddf4
      Eric Dumazet 提交于
      Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
      dirtying neighbour in stress situation (many different flows / dsts)
      
      Dirtying takes place because of read_lock(&n->lock) and n->used writes.
      
      Switching to a seqlock, and writing n->used only on jiffies changes
      permits less dirtying.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ed8ddf4
    • K
      net: clear heap allocations for privileged ethtool actions · b00916b1
      Kees Cook 提交于
      Several other ethtool functions leave heap uncleared (potentially) by
      drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
      are well controlled. In some situations (e.g. unchecked error conditions),
      the heap will remain unchanged in areas before copying back to userspace.
      Note that these are less of an issue since these all require CAP_NET_ADMIN.
      
      Cc: stable@kernel.org
      Signed-off-by: NKees Cook <kees.cook@canonical.com>
      Acked-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b00916b1
    • E
      neigh: speedup neigh_hh_init() · 34d101dd
      Eric Dumazet 提交于
      When a new dst is used to send a frame, neigh_resolve_output() tries to
      associate an struct hh_cache to this dst, calling neigh_hh_init() with
      the neigh rwlock write locked.
      
      Most of the time, hh_cache is already known and linked into neighbour,
      so we find it and increment its refcount.
      
      This patch changes the logic so that we call neigh_hh_init() with
      neighbour lock read locked only, so that fast path can be run in
      parallel by concurrent cpus.
      
      This brings part of the speedup we got with commit c7d4426a
      (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
      removing one of the contention point that routers hit on multiqueue
      enabled machines.
      
      Further improvements would need to use a seqlock instead of an rwlock to
      protect neigh->ha[], to not dirty neigh too often and remove two atomic
      ops.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34d101dd
  12. 09 10月, 2010 3 次提交
  13. 08 10月, 2010 1 次提交
    • P
      net: suppress RCU lockdep false positive in sock_update_classid · 1144182a
      Paul E. McKenney 提交于
      > ===================================================
      > [ INFO: suspicious rcu_dereference_check() usage. ]
      > ---------------------------------------------------
      > include/linux/cgroup.h:542 invoked rcu_dereference_check() without protection!
      >
      > other info that might help us debug this:
      >
      >
      > rcu_scheduler_active = 1, debug_locks = 0
      > 1 lock held by swapper/1:
      >  #0:  (net_mutex){+.+.+.}, at: [<ffffffff813e9010>]
      > register_pernet_subsys+0x1f/0x47
      >
      > stack backtrace:
      > Pid: 1, comm: swapper Not tainted 2.6.35.4-28.fc14.x86_64 #1
      > Call Trace:
      >  [<ffffffff8107bd3a>] lockdep_rcu_dereference+0xaa/0xb3
      >  [<ffffffff813e04b9>] sock_update_classid+0x7c/0xa2
      >  [<ffffffff813e054a>] sk_alloc+0x6b/0x77
      >  [<ffffffff8140b281>] __netlink_create+0x37/0xab
      >  [<ffffffff813f941c>] ? rtnetlink_rcv+0x0/0x2d
      >  [<ffffffff8140cee1>] netlink_kernel_create+0x74/0x19d
      >  [<ffffffff8149c3ca>] ? __mutex_lock_common+0x339/0x35b
      >  [<ffffffff813f7e9c>] rtnetlink_net_init+0x2e/0x48
      >  [<ffffffff813e8d7a>] ops_init+0xe9/0xff
      >  [<ffffffff813e8f0d>] register_pernet_operations+0xab/0x130
      >  [<ffffffff813e901f>] register_pernet_subsys+0x2e/0x47
      >  [<ffffffff81db7bca>] rtnetlink_init+0x53/0x102
      >  [<ffffffff81db835c>] netlink_proto_init+0x126/0x143
      >  [<ffffffff81db8236>] ? netlink_proto_init+0x0/0x143
      >  [<ffffffff810021b8>] do_one_initcall+0x72/0x186
      >  [<ffffffff81d78ebc>] kernel_init+0x23b/0x2c9
      >  [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
      >  [<ffffffff8149e2d0>] ? restore_args+0x0/0x30
      >  [<ffffffff81d78c81>] ? kernel_init+0x0/0x2c9
      >  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
      
      The sock_update_classid() function calls task_cls_classid(current),
      but the calling task cannot go away, so there is no danger of
      the associated structures disappearing.  Insert an RCU read-side
      critical section to suppress the false positive.
      Reported-by: NSubrata Modak <subrata@linux.vnet.ibm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1144182a
  14. 07 10月, 2010 2 次提交
  15. 06 10月, 2010 2 次提交
    • E
      fib: RCU conversion of fib_lookup() · ebc0ffae
      Eric Dumazet 提交于
      fib_lookup() converted to be called in RCU protected context, no
      reference taken and released on a contended cache line (fib_clntref)
      
      fib_table_lookup() and fib_semantic_match() get an additional parameter.
      
      struct fib_info gets an rcu_head field, and is freed after an rcu grace
      period.
      
      Stress test :
      (Sending 160.000.000 UDP frames on same neighbour,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_HASH) (about same results for FIB_TRIE)
      
      Before patch :
      
      real	1m31.199s
      user	0m13.761s
      sys	23m24.780s
      
      After patch:
      
      real	1m5.375s
      user	0m14.997s
      sys	15m50.115s
      
      Before patch Profile :
      
      13044.00 15.4% __ip_route_output_key vmlinux
       8438.00 10.0% dst_destroy           vmlinux
       5983.00  7.1% fib_semantic_match    vmlinux
       5410.00  6.4% fib_rules_lookup      vmlinux
       4803.00  5.7% neigh_lookup          vmlinux
       4420.00  5.2% _raw_spin_lock        vmlinux
       3883.00  4.6% rt_set_nexthop        vmlinux
       3261.00  3.9% _raw_read_lock        vmlinux
       2794.00  3.3% fib_table_lookup      vmlinux
       2374.00  2.8% neigh_resolve_output  vmlinux
       2153.00  2.5% dst_alloc             vmlinux
       1502.00  1.8% _raw_read_lock_bh     vmlinux
       1484.00  1.8% kmem_cache_alloc      vmlinux
       1407.00  1.7% eth_header            vmlinux
       1406.00  1.7% ipv4_dst_destroy      vmlinux
       1298.00  1.5% __copy_from_user_ll   vmlinux
       1174.00  1.4% dev_queue_xmit        vmlinux
       1000.00  1.2% ip_output             vmlinux
      
      After patch Profile :
      
      13712.00 15.8% dst_destroy             vmlinux
       8548.00  9.9% __ip_route_output_key   vmlinux
       7017.00  8.1% neigh_lookup            vmlinux
       4554.00  5.3% fib_semantic_match      vmlinux
       4067.00  4.7% _raw_read_lock          vmlinux
       3491.00  4.0% dst_alloc               vmlinux
       3186.00  3.7% neigh_resolve_output    vmlinux
       3103.00  3.6% fib_table_lookup        vmlinux
       2098.00  2.4% _raw_read_lock_bh       vmlinux
       2081.00  2.4% kmem_cache_alloc        vmlinux
       2013.00  2.3% _raw_spin_lock          vmlinux
       1763.00  2.0% __copy_from_user_ll     vmlinux
       1763.00  2.0% ip_output               vmlinux
       1761.00  2.0% ipv4_dst_destroy        vmlinux
       1631.00  1.9% eth_header              vmlinux
       1440.00  1.7% _raw_read_unlock_bh     vmlinux
      
      Reference results, if IP route cache is enabled :
      
      real	0m29.718s
      user	0m10.845s
      sys	7m37.341s
      
      25213.00 29.5% __ip_route_output_key   vmlinux
       9011.00 10.5% dst_release             vmlinux
       4817.00  5.6% ip_push_pending_frames  vmlinux
       4232.00  5.0% ip_finish_output        vmlinux
       3940.00  4.6% udp_sendmsg             vmlinux
       3730.00  4.4% __copy_from_user_ll     vmlinux
       3716.00  4.4% ip_route_output_flow    vmlinux
       2451.00  2.9% __xfrm_lookup           vmlinux
       2221.00  2.6% ip_append_data          vmlinux
       1718.00  2.0% _raw_spin_lock_bh       vmlinux
       1655.00  1.9% __alloc_skb             vmlinux
       1572.00  1.8% sock_wfree              vmlinux
       1345.00  1.6% kfree                   vmlinux
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebc0ffae
    • E
      net neigh: RCU conversion of neigh hash table · d6bf7817
      Eric Dumazet 提交于
      David
      
      This is the first step for RCU conversion of neigh code.
      
      Next patches will convert hash_buckets[] and "struct neighbour" to RCU
      protected objects.
      
      Thanks
      
      [PATCH net-next] net neigh: RCU conversion of neigh hash table
      
      Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
      neigh_table", a new structure is defined :
      
      struct neigh_hash_table {
             struct neighbour        **hash_buckets;
             unsigned int            hash_mask;
             __u32                   hash_rnd;
             struct rcu_head         rcu;
      };
      
      And "struct neigh_table" has an RCU protected pointer to such a
      neigh_hash_table.
      
      This means the signature of (*hash)() function changed: We need to add a
      third parameter with the actual hash_rnd value, since this is not
      anymore a neigh_table field.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6bf7817