1. 04 10月, 2010 4 次提交
  2. 01 10月, 2010 3 次提交
    • E
      ipv4: rcu conversion in ip_route_output_slow · 0197aa38
      Eric Dumazet 提交于
      ip_route_output_slow() is enclosed in an rcu_read_lock() protected
      section, so that no references are taken/released on device, thanks to
      __ip_dev_find() & dev_get_by_index_rcu()
      
      Tested with ip route cache disabled, and a stress test :
      
      Before patch:
      
      elapsed time :
      
      real	1m38.347s
      user	0m11.909s
      sys	23m51.501s
      
      Profile:
      
      13788.00 22.7% ip_route_output_slow [kernel]
       7875.00 13.0% dst_destroy          [kernel]
       3925.00  6.5% fib_semantic_match   [kernel]
       3144.00  5.2% fib_rules_lookup     [kernel]
       3061.00  5.0% dst_alloc            [kernel]
       2276.00  3.7% rt_set_nexthop       [kernel]
       1762.00  2.9% fib_table_lookup     [kernel]
       1538.00  2.5% _raw_read_lock       [kernel]
       1358.00  2.2% ip_output            [kernel]
      
      After patch:
      
      real	1m28.808s
      user	0m13.245s
      sys	20m37.293s
      
      10950.00 17.2% ip_route_output_slow [kernel]
      10726.00 16.9% dst_destroy          [kernel]
       5170.00  8.1% fib_semantic_match   [kernel]
       3937.00  6.2% dst_alloc            [kernel]
       3635.00  5.7% rt_set_nexthop       [kernel]
       2900.00  4.6% fib_rules_lookup     [kernel]
       2240.00  3.5% fib_table_lookup     [kernel]
       1427.00  2.2% _raw_read_lock       [kernel]
       1157.00  1.8% kmem_cache_alloc     [kernel]
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0197aa38
    • E
      ipv4: introduce __ip_dev_find() · 82efee14
      Eric Dumazet 提交于
      ip_dev_find(net, addr) finds a device given an IPv4 source address and
      takes a reference on it.
      
      Introduce __ip_dev_find(), taking a third argument, to optionally take
      the device reference. Callers not asking the reference to be taken
      should be in an rcu_read_lock() protected section.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82efee14
    • E
      ipv4: __mkroute_output() speedup · dd28d1a0
      Eric Dumazet 提交于
      While doing stress tests with a disabled IP route cache, I found
      __mkroute_output() was touching three times in_device atomic refcount.
      
      Use RCU to touch it once to reduce cache line ping pongs.
      
      Before patch
      
      time to perform the test
      real	1m42.009s
      user	0m12.545s
      sys	25m0.726s
      
      Profile :
      
      16109.00 26.4% ip_route_output_slow   vmlinux
       7434.00 12.2% dst_destroy            vmlinux
       3280.00  5.4% fib_rules_lookup       vmlinux
       3252.00  5.3% fib_semantic_match     vmlinux
       2622.00  4.3% fib_table_lookup       vmlinux
       2535.00  4.1% dst_alloc              vmlinux
       1750.00  2.9% _raw_read_lock         vmlinux
       1532.00  2.5% rt_set_nexthop         vmlinux
      
      After patch
      
      real	1m36.503s
      user	0m12.977s
      sys	23m25.608s
      
      14234.00 22.4% ip_route_output_slow   vmlinux
       8717.00 13.7% dst_destroy            vmlinux
       4052.00  6.4% fib_rules_lookup       vmlinux
       3951.00  6.2% fib_semantic_match     vmlinux
       3191.00  5.0% dst_alloc              vmlinux
       1764.00  2.8% fib_table_lookup       vmlinux
       1692.00  2.7% _raw_read_lock         vmlinux
       1605.00  2.5% rt_set_nexthop         vmlinux
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd28d1a0
  3. 30 9月, 2010 6 次提交
  4. 29 9月, 2010 1 次提交
    • T
      ipv4: Allow configuring subnets as local addresses · 4465b469
      Tom Herbert 提交于
      This patch allows a host to be configured to respond to any address in
      a specified range as if it were local, without actually needing to
      configure the address on an interface.  This is done through routing
      table configuration.  For instance, to configure a host to respond
      to any address in 10.1/16 received on eth0 as a local address we can do:
      
      ip rule add from all iif eth0 lookup 200
      ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200
      
      This host is now reachable by any 10.1/16 address (route lookup on
      input for packets received on eth0 can find the route).  On output, the
      rule will not be matched so that this host can still send packets to
      10.1/16 (not sent on loopback).  Presumably, external routing can be
      configured to make sense out of this.
      
      To make this work, we needed to modify the logic in finding the
      interface which is assigned a given source address for output
      (dev_ip_find).  We perform a normal fib_lookup instead of just a
      lookup on the local table, and in the lookup we ignore the input
      interface for matching.
      
      This patch is useful to implement IP-anycast for subnets of virtual
      addresses.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4465b469
  5. 28 9月, 2010 2 次提交
    • E
      ipip: percpu stats accounting · 3c97af99
      Eric Dumazet 提交于
      Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
      
      Other seldom used fields are kept in netdev->stats structure, possibly
      unsafe.
      
      This is a preliminary work to support lockless transmit path, and
      correct RX stats, that are already unsafe.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c97af99
    • E
      ip_gre: percpu stats accounting · e985aad7
      Eric Dumazet 提交于
      Le lundi 27 septembre 2010 à 14:29 +0100, Ben Hutchings a écrit :
      
      > > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
      > > index 5d6ddcb..de39b22 100644
      > > --- a/net/ipv4/ip_gre.c
      > > +++ b/net/ipv4/ip_gre.c
      > [...]
      > > @@ -377,7 +405,7 @@ static struct ip_tunnel *ipgre_tunnel_locate(struct net *net,
      > >  	if (parms->name[0])
      > >  		strlcpy(name, parms->name, IFNAMSIZ);
      > >  	else
      > > -		sprintf(name, "gre%%d");
      > > +		strcpy(name, "gre%d");
      > >
      > >  	dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup);
      > >  	if (!dev)
      > [...]
      >
      > This is a valid fix, but doesn't belong in this patch!
      >
      
      Sorry ? It was not a fix, but at most a cleanup ;)
      
      Anyway I forgot the gretap case...
      
      [PATCH 2/4 v2] ip_gre: percpu stats accounting
      
      Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
      
      Other seldom used fields are kept in netdev->stats structure, possibly
      unsafe.
      
      This is a preliminary work to support lockless transmit path, and
      correct RX stats, that are already unsafe.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e985aad7
  6. 27 9月, 2010 1 次提交
  7. 25 9月, 2010 1 次提交
    • E
      ip: take care of last fragment in ip_append_data · 59104f06
      Eric Dumazet 提交于
      While investigating a bit, I found ip_fragment() slow path was taken
      because ip_append_data() provides following layout for a send(MTU +
      N*(MTU - 20)) syscall :
      
      - one skb with 1500 (mtu) bytes
      - N fragments of 1480 (mtu-20) bytes (before adding IP header)
      last fragment gets 17 bytes of trail data because of following bit:
      
      	if (datalen == length + fraggap)
      		alloclen += rt->dst.trailer_len;
      
      Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
      another bug ?)
      
      In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
      so we take slow path, building another skb chain.
      
      In order to avoid taking slow path, we should correct ip_append_data()
      to make sure last fragment has real trail space, under mtu...
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59104f06
  8. 24 9月, 2010 1 次提交
  9. 23 9月, 2010 4 次提交
  10. 22 9月, 2010 1 次提交
    • E
      ip: fix truesize mismatch in ip fragmentation · 3d13008e
      Eric Dumazet 提交于
      Special care should be taken when slow path is hit in ip_fragment() :
      
      When walking through frags, we transfert truesize ownership from skb to
      frags. Then if we hit a slow_path condition, we must undo this or risk
      uncharging frags->truesize twice, and in the end, having negative socket
      sk_wmem_alloc counter, or even freeing socket sooner than expected.
      
      Many thanks to Nick Bowler, who provided a very clean bug report and
      test program.
      
      Thanks to Jarek for reviewing my first patch and providing a V2
      
      While Nick bisection pointed to commit 2b85a34e (net: No more
      expensive sock_hold()/sock_put() on each tx), underlying bug is older
      (2.6.12-rc5)
      
      A side effect is to extend work done in commit b2722b1c
      (ip_fragment: also adjust skb->truesize for packets not owned by a
      socket) to ipv6 as well.
      Reported-and-bisected-by: NNick Bowler <nbowler@elliptictech.com>
      Tested-by: NNick Bowler <nbowler@elliptictech.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d13008e
  11. 21 9月, 2010 4 次提交
  12. 20 9月, 2010 1 次提交
  13. 16 9月, 2010 3 次提交
  14. 14 9月, 2010 2 次提交
    • M
      ipv4: enable getsockopt() for IP_NODEFRAG · a89b4763
      Michael Kerrisk 提交于
      While integrating your man-pages patch for IP_NODEFRAG, I noticed
      that this option is settable by setsockopt(), but not gettable by
      getsockopt(). I suppose this is not intended. The (untested,
      trivial) patch below adds getsockopt() support.
      Signed-off-by: NMichael kerrisk <mtk.manpages@gmail.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a89b4763
    • B
      ipv4: force_igmp_version ignored when a IGMPv3 query received · 79981563
      Bob Arendt 提交于
      After all these years, it turns out that the
          /proc/sys/net/ipv4/conf/*/force_igmp_version
      parameter isn't fully implemented.
      
      *Symptom*:
      When set force_igmp_version to a value of 2, the kernel should only perform
      multicast IGMPv2 operations (IETF rfc2236).  An host-initiated Join message
      will be sent as a IGMPv2 Join message.  But if a IGMPv3 query message is
      received, the host responds with a IGMPv3 join message.  Per rfc3376 and
      rfc2236, a IGMPv2 host should treat a IGMPv3 query as a IGMPv2 query and
      respond with an IGMPv2 Join message.
      
      *Consequences*:
      This is an issue when a IGMPv3 capable switch is the querier and will only
      issue IGMPv3 queries (which double as IGMPv2 querys) and there's an
      intermediate switch that is only IGMPv2 capable.  The intermediate switch
      processes the initial v2 Join, but fails to recognize the IGMPv3 Join responses
      to the Query, resulting in a dropped connection when the intermediate v2-only
      switch times it out.
      
      *Identifying issue in the kernel source*:
      The issue is in this section of code (in net/ipv4/igmp.c), which is called when
      an IGMP query is received  (from mainline 2.6.36-rc3 gitweb):
       ...
      A IGMPv3 query has a length >= 12 and no sources.  This routine will exit after
      line 880, setting the general query timer (random timeout between 0 and query
      response time).  This calls igmp_gq_timer_expire():
      ...
      .. which only sends a v3 response.  So if a v3 query is received, the kernel
      always sends a v3 response.
      
      IGMP queries happen once every 60 sec (per vlan), so the traffic is low.  A
      IGMPv3 query *is* a strict superset of a IGMPv2 query, so this patch properly
      short circuit's the v3 behaviour.
      
      One issue is that this does not address force_igmp_version=1.  Then again, I've
      never seen any IGMPv1 multicast equipment in the wild.  However there is a lot
      of v2-only equipment. If it's necessary to support the IGMPv1 case as well:
      
      837         if (len == 8 || IGMP_V2_SEEN(in_dev) || IGMP_V1_SEEN(in_dev)) {
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79981563
  15. 11 9月, 2010 1 次提交
  16. 10 9月, 2010 1 次提交
  17. 09 9月, 2010 4 次提交
    • E
      udp: add rehash on connect() · 719f8358
      Eric Dumazet 提交于
      commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
      added a secondary hash on UDP, hashed on (local addr, local port).
      
      Problem is that following sequence :
      
      fd = socket(...)
      connect(fd, &remote, ...)
      
      not only selects remote end point (address and port), but also sets
      local address, while UDP stack stored in secondary hash table the socket
      while its local address was INADDR_ANY (or ipv6 equivalent)
      
      Sequence is :
       - autobind() : choose a random local port, insert socket in hash tables
                    [while local address is INADDR_ANY]
       - connect() : set remote address and port, change local address to IP
                    given by a route lookup.
      
      When an incoming UDP frame comes, if more than 10 sockets are found in
      primary hash table, we switch to secondary table, and fail to find
      socket because its local address changed.
      
      One solution to this problem is to rehash datagram socket if needed.
      
      We add a new rehash(struct socket *) method in "struct proto", and
      implement this method for UDP v4 & v6, using a common helper.
      
      This rehashing only takes care of secondary hash table, since primary
      hash (based on local port only) is not changed.
      Reported-by: NKrzysztof Piotr Oledzki <ole@ans.pl>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: NKrzysztof Piotr Oledzki <ole@ans.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      719f8358
    • E
      net: inet_add_protocol() can use cmpxchg() · e0386005
      Eric Dumazet 提交于
      Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
      friends.
      
      inet_protos[] & inet6_protos[] are moved to read_mostly section
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0386005
    • J
      net: blackhole route should always be recalculated · ae2688d5
      Jianzhao Wang 提交于
      Blackhole routes are used when xfrm_lookup() returns -EREMOTE (error
      triggered by IKE for example), hence this kind of route is always
      temporary and so we should check if a better route exists for next
      packets.
      Bug has been introduced by commit d11a4dc1.
      Signed-off-by: NJianzhao Wang <jianzhao.wang@6wind.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae2688d5
    • J
      ipv4: Suppress lockdep-RCU false positive in FIB trie (3) · f6b085b6
      Jarek Poplawski 提交于
      Hi,
      Here is one more of these warnings and a patch below:
      
      Sep  5 23:52:33 del kernel: [46044.244833] ===================================================
      Sep  5 23:52:33 del kernel: [46044.269681] [ INFO: suspicious rcu_dereference_check() usage. ]
      Sep  5 23:52:33 del kernel: [46044.277000] ---------------------------------------------------
      Sep  5 23:52:33 del kernel: [46044.285185] net/ipv4/fib_trie.c:1756 invoked rcu_dereference_check() without protection!
      Sep  5 23:52:33 del kernel: [46044.293627]
      Sep  5 23:52:33 del kernel: [46044.293632] other info that might help us debug this:
      Sep  5 23:52:33 del kernel: [46044.293634]
      Sep  5 23:52:33 del kernel: [46044.325333]
      Sep  5 23:52:33 del kernel: [46044.325335] rcu_scheduler_active = 1, debug_locks = 0
      Sep  5 23:52:33 del kernel: [46044.348013] 1 lock held by pppd/1717:
      Sep  5 23:52:33 del kernel: [46044.357548]  #0:  (rtnl_mutex){+.+.+.}, at: [<c125dc1f>] rtnl_lock+0xf/0x20
      Sep  5 23:52:33 del kernel: [46044.367647]
      Sep  5 23:52:33 del kernel: [46044.367652] stack backtrace:
      Sep  5 23:52:33 del kernel: [46044.387429] Pid: 1717, comm: pppd Not tainted 2.6.35.4.4a #3
      Sep  5 23:52:33 del kernel: [46044.398764] Call Trace:
      Sep  5 23:52:33 del kernel: [46044.409596]  [<c12f9aba>] ? printk+0x18/0x1e
      Sep  5 23:52:33 del kernel: [46044.420761]  [<c1053969>] lockdep_rcu_dereference+0xa9/0xb0
      Sep  5 23:52:33 del kernel: [46044.432229]  [<c12b7235>] trie_firstleaf+0x65/0x70
      Sep  5 23:52:33 del kernel: [46044.443941]  [<c12b74d4>] fib_table_flush+0x14/0x170
      Sep  5 23:52:33 del kernel: [46044.455823]  [<c1033e92>] ? local_bh_enable_ip+0x62/0xd0
      Sep  5 23:52:33 del kernel: [46044.467995]  [<c12fc39f>] ? _raw_spin_unlock_bh+0x2f/0x40
      Sep  5 23:52:33 del kernel: [46044.480404]  [<c12b24d0>] ? fib_sync_down_dev+0x120/0x180
      Sep  5 23:52:33 del kernel: [46044.493025]  [<c12b069d>] fib_flush+0x2d/0x60
      Sep  5 23:52:33 del kernel: [46044.505796]  [<c12b06f5>] fib_disable_ip+0x25/0x50
      Sep  5 23:52:33 del kernel: [46044.518772]  [<c12b10d3>] fib_netdev_event+0x73/0xd0
      Sep  5 23:52:33 del kernel: [46044.531918]  [<c1048dfd>] notifier_call_chain+0x2d/0x70
      Sep  5 23:52:33 del kernel: [46044.545358]  [<c1048f0a>] raw_notifier_call_chain+0x1a/0x20
      Sep  5 23:52:33 del kernel: [46044.559092]  [<c124f687>] call_netdevice_notifiers+0x27/0x60
      Sep  5 23:52:33 del kernel: [46044.573037]  [<c124faec>] __dev_notify_flags+0x5c/0x80
      Sep  5 23:52:33 del kernel: [46044.586489]  [<c124fb47>] dev_change_flags+0x37/0x60
      Sep  5 23:52:33 del kernel: [46044.599394]  [<c12a8a8d>] devinet_ioctl+0x54d/0x630
      Sep  5 23:52:33 del kernel: [46044.612277]  [<c12aabb7>] inet_ioctl+0x97/0xc0
      Sep  5 23:52:34 del kernel: [46044.625208]  [<c123f6af>] sock_ioctl+0x6f/0x270
      Sep  5 23:52:34 del kernel: [46044.638046]  [<c109d2b0>] ? handle_mm_fault+0x420/0x6c0
      Sep  5 23:52:34 del kernel: [46044.650968]  [<c123f640>] ? sock_ioctl+0x0/0x270
      Sep  5 23:52:34 del kernel: [46044.663865]  [<c10c3188>] vfs_ioctl+0x28/0xa0
      Sep  5 23:52:34 del kernel: [46044.676556]  [<c10c38fa>] do_vfs_ioctl+0x6a/0x5c0
      Sep  5 23:52:34 del kernel: [46044.688989]  [<c1048676>] ? up_read+0x16/0x30
      Sep  5 23:52:34 del kernel: [46044.701411]  [<c1021376>] ? do_page_fault+0x1d6/0x3a0
      Sep  5 23:52:34 del kernel: [46044.714223]  [<c10b6588>] ? fget_light+0xf8/0x2f0
      Sep  5 23:52:34 del kernel: [46044.726601]  [<c1241f98>] ? sys_socketcall+0x208/0x2c0
      Sep  5 23:52:34 del kernel: [46044.739140]  [<c10c3eb3>] sys_ioctl+0x63/0x70
      Sep  5 23:52:34 del kernel: [46044.751967]  [<c12fca3d>] syscall_call+0x7/0xb
      Sep  5 23:52:34 del kernel: [46044.764734]  [<c12f0000>] ? cookie_v6_check+0x3d0/0x630
      
      -------------->
      
      This patch fixes the warning:
       ===================================================
       [ INFO: suspicious rcu_dereference_check() usage. ]
       ---------------------------------------------------
       net/ipv4/fib_trie.c:1756 invoked rcu_dereference_check() without protection!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 1, debug_locks = 0
       1 lock held by pppd/1717:
        #0:  (rtnl_mutex){+.+.+.}, at: [<c125dc1f>] rtnl_lock+0xf/0x20
      
       stack backtrace:
       Pid: 1717, comm: pppd Not tainted 2.6.35.4a #3
       Call Trace:
        [<c12f9aba>] ? printk+0x18/0x1e
        [<c1053969>] lockdep_rcu_dereference+0xa9/0xb0
        [<c12b7235>] trie_firstleaf+0x65/0x70
        [<c12b74d4>] fib_table_flush+0x14/0x170
        ...
      
      Allow trie_firstleaf() to be called either under rcu_read_lock()
      protection or with RTNL held. The same annotation is added to
      node_parent_rcu() to prevent a similar warning a bit later.
      
      Followup of commits 634a4b20 and 4eaa0e3c.
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6b085b6