1. 27 1月, 2011 1 次提交
    • D
      net: Implement read-only protection and COW'ing of metrics. · 62fa8a84
      David S. Miller 提交于
      Routing metrics are now copy-on-write.
      
      Initially a route entry points it's metrics at a read-only location.
      If a routing table entry exists, it will point there.  Else it will
      point at the all zero metric place-holder called 'dst_default_metrics'.
      
      The writeability state of the metrics is stored in the low bits of the
      metrics pointer, we have two bits left to spare if we want to store
      more states.
      
      For the initial implementation, COW is implemented simply via kmalloc.
      However future enhancements will change this to place the writable
      metrics somewhere else, in order to increase sharing.  Very likely
      this "somewhere else" will be the inetpeer cache.
      
      Note also that this means that metrics updates may transiently fail
      if we cannot COW the metrics successfully.
      
      But even by itself, this patch should decrease memory usage and
      increase cache locality especially for routing workloads.  In those
      cases the read-only metric copies stay in place and never get written
      to.
      
      TCP workloads where metrics get updated, and those rare cases where
      PMTU triggers occur, will take a very slight performance hit.  But
      that hit will be alleviated when the long-term writable metrics
      move to a more sharable location.
      
      Since the metrics storage went from a u32 array of RTAX_MAX entries to
      what is essentially a pointer, some retooling of the dst_entry layout
      was necessary.
      
      Most importantly, we need to preserve the alignment of the reference
      count so that it doesn't share cache lines with the read-mostly state,
      as per Eric Dumazet's alignment assertion checks.
      
      The only non-trivial bit here is the move of the 'flags' member into
      the writeable cacheline.  This is OK since we are always accessing the
      flags around the same moment when we made a modification to the
      reference count.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62fa8a84
  2. 14 1月, 2011 1 次提交
    • P
      netfilter: fix Kconfig dependencies · c7066f70
      Patrick McHardy 提交于
      Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
      which itself depends on NET_SCHED; this dependency is missing from netfilter.
      
      Since matching on realms is also useful without having NET_SCHED enabled and
      the option really only controls whether the tclassid member is included in
      route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
      it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.
      Reported-by: NVladis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      c7066f70
  3. 15 12月, 2010 1 次提交
  4. 14 12月, 2010 1 次提交
    • D
      net: Abstract default ADVMSS behind an accessor. · 0dbaee3b
      David S. Miller 提交于
      Make all RTAX_ADVMSS metric accesses go through a new helper function,
      dst_metric_advmss().
      
      Leave the actual default metric as "zero" in the real metric slot,
      and compute the actual default value dynamically via a new dst_ops
      AF specific callback.
      
      For stacked IPSEC routes, we use the advmss of the path which
      preserves existing behavior.
      
      Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
      advmss on pmtu updates.  This inconsistency in advmss handling
      results in more raw metric accesses than I wish we ended up with.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dbaee3b
  5. 13 12月, 2010 2 次提交
    • D
      ipv4: Don't pre-seed hoplimit metric. · 323e126f
      David S. Miller 提交于
      Always go through a new ip4_dst_hoplimit() helper, just like ipv6.
      
      This allowed several simplifications:
      
      1) The interim dst_metric_hoplimit() can go as it's no longer
         userd.
      
      2) The sysctl_ip_default_ttl entry no longer needs to use
         ipv4_doint_and_flush, since the sysctl is not cached in
         routing cache metrics any longer.
      
      3) ipv4_doint_and_flush no longer needs to be exported and
         therefore can be marked static.
      
      When ipv4_doint_and_flush_strategy was removed some time ago,
      the external declaration in ip.h was mistakenly left around
      so kill that off too.
      
      We have to move the sysctl_ip_default_ttl declaration into
      ipv4's route cache definition header net/route.h, because
      currently net/ip.h (where the declaration lives now) has
      a back dependency on net/route.h
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      323e126f
    • D
      5170ae82
  6. 10 12月, 2010 1 次提交
    • D
      net: Abstract away all dst_entry metrics accesses. · defb3519
      David S. Miller 提交于
      Use helper functions to hide all direct accesses, especially writes,
      to dst_entry metrics values.
      
      This will allow us to:
      
      1) More easily change how the metrics are stored.
      
      2) Implement COW for metrics.
      
      In particular this will help us put metrics into the inetpeer
      cache if that is what we end up doing.  We can make the _metrics
      member a pointer instead of an array, initially have it point
      at the read-only metrics in the FIB, and then on the first set
      grab an inetpeer entry and point the _metrics member there.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      defb3519
  7. 09 11月, 2010 1 次提交
  8. 28 10月, 2010 1 次提交
  9. 04 10月, 2010 1 次提交
    • E
      net: introduce DST_NOCACHE flag · c7d4426a
      Eric Dumazet 提交于
      While doing stress tests with IP route cache disabled, and multi queue
      devices, I noticed a very high contention on one rwlock used in
      neighbour code.
      
      When many cpus are trying to send frames (possibly using a high
      performance multiqueue device) to the same neighbour, they fight for the
      neigh->lock rwlock in order to call neigh_hh_init(), and fight on
      hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())
      
      But we dont need to call neigh_hh_init() for dst that are used only
      once. It costs four atomic operations at least, on two contended cache
      lines, plus the high contention on neigh->lock rwlock.
      
      Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
      inserted in route cache.
      
      With the stress test bench, sending 160000000 frames on one neighbour,
      results are :
      
      Before patch:
      
      real	2m28.406s
      user	0m11.781s
      sys	36m17.964s
      
      
      After patch:
      
      real	1m26.532s
      user	0m12.185s
      sys	20m3.903s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d4426a
  10. 28 9月, 2010 1 次提交
  11. 27 9月, 2010 1 次提交
  12. 05 6月, 2010 1 次提交
  13. 18 5月, 2010 2 次提交
    • E
      net: Introduce skb_tunnel_rx() helper · d19d56dd
      Eric Dumazet 提交于
      skb rxhash should be cleared when a skb is handled by a tunnel before
      being delivered again, so that correct packet steering can take place.
      
      There are other cleanups and accounting that we can factorize in a new
      helper, skb_tunnel_rx()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d19d56dd
    • E
      net: add a noref bit on skb dst · 7fee226a
      Eric Dumazet 提交于
      Use low order bit of skb->_skb_dst to tell dst is not refcounted.
      
      Change _skb_dst to _skb_refdst to make sure all uses are catched.
      
      skb_dst() returns the dst, regardless of noref bit set or not, but
      with a lockdep check to make sure a noref dst is not given if current
      user is not rcu protected.
      
      New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
      (with lockdep check)
      
      skb_dst_drop() drops a reference only if skb dst was refcounted.
      
      skb_dst_force() helper is used to force a refcount on dst, when skb
      is queued and not anymore RCU protected.
      
      Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
      !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
      sock_queue_rcv_skb(), in __nf_queue().
      
      Use skb_dst_force() in dev_requeue_skb().
      
      Note: dst_use_noref() still dirties dst, we might transform it
      later to do one dirtying per jiffies.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fee226a
  14. 13 4月, 2010 1 次提交
    • E
      net: sk_dst_cache RCUification · b6c6712a
      Eric Dumazet 提交于
      With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
      work.
      
      sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
      
      This rwlock is readlocked for a very small amount of time, and dst
      entries are already freed after RCU grace period. This calls for RCU
      again :)
      
      This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
      
      __sk_dst_get() is supposed to be called with rcu_read_lock() or if
      socket locked by user, so use appropriate rcu_dereference_check()
      condition (rcu_read_lock_held() || sock_owned_by_user(sk))
      
      This patch avoids two atomic ops per tx packet on UDP connected sockets,
      for example, and permits sk_dst_lock to be much less dirtied.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6c6712a
  15. 24 12月, 2009 1 次提交
    • L
      net: Add rtnetlink init_rcvwnd to set the TCP initial receive window · 31d12926
      laurent chavey 提交于
      Add rtnetlink init_rcvwnd to set the TCP initial receive window size
      advertised by passive and active TCP connections.
      The current Linux TCP implementation limits the advertised TCP initial
      receive window to the one prescribed by slow start. For short lived
      TCP connections used for transaction type of traffic (i.e. http
      requests), bounding the advertised TCP initial receive window results
      in increased latency to complete the transaction.
      Support for setting initial congestion window is already supported
      using rtnetlink init_cwnd, but the feature is useless without the
      ability to set a larger TCP initial receive window.
      The rtnetlink init_rcvwnd allows increasing the TCP initial receive
      window, allowing TCP connection to advertise larger TCP receive window
      than the ones bounded by slow start.
      Signed-off-by: NLaurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31d12926
  16. 16 12月, 2009 1 次提交
    • D
      tcp: Revert per-route SACK/DSACK/TIMESTAMP changes. · bb5b7c11
      David S. Miller 提交于
      It creates a regression, triggering badness for SYN_RECV
      sockets, for example:
      
      [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
      [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
      [19148.023035] REGS: eeecbd30 TRAP: 0700   Not tainted  (2.6.32)
      [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR>  CR: 24002442  XER: 00000000
      [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000
      
      This is likely caused by the change in the 'estab' parameter
      passed to tcp_parse_options() when invoked by the functions
      in net/ipv4/tcp_minisocks.c
      
      But even if that is fixed, the ->conn_request() changes made in
      this patch series is fundamentally wrong.  They try to use the
      listening socket's 'dst' to probe the route settings.  The
      listening socket doesn't even have a route, and you can't
      get the right route (the child request one) until much later
      after we setup all of the state, and it must be done by hand.
      
      This stuff really isn't ready, so the best thing to do is a
      full revert.  This reverts the following commits:
      
      f55017a9
      022c3f7d
      1aba721e
      cda42ebd
      345cda2f
      dc343475
      05eaade2
      6a2a2d6bSigned-off-by: NDavid S. Miller <davem@davemloft.net>
      bb5b7c11
  17. 05 11月, 2009 1 次提交
  18. 04 11月, 2009 1 次提交
  19. 29 10月, 2009 1 次提交
  20. 21 10月, 2009 1 次提交
  21. 02 9月, 2009 1 次提交
    • A
      netns: embed ip6_dst_ops directly · 86393e52
      Alexey Dobriyan 提交于
      struct net::ipv6.ip6_dst_ops is separatedly dynamically allocated,
      but there is no fundamental reason for it. Embed it directly into
      struct netns_ipv6.
      
      For that:
      * move struct dst_ops into separate header to fix circular dependencies
      	I honestly tried not to, it's pretty impossible to do other way
      * drop dynamical allocation, allocate together with netns
      
      For a change, remove struct dst_ops::dst_net, it's deducible
      by using container_of() given dst_ops pointer.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86393e52
  22. 03 6月, 2009 1 次提交
  23. 26 11月, 2008 1 次提交
  24. 17 11月, 2008 1 次提交
    • E
      net: make sure struct dst_entry refcount is aligned on 64 bytes · 5635c10d
      Eric Dumazet 提交于
      As found in the past (commit f1dd9c37
      [NET]: Fix tbench regression in 2.6.25-rc1), it is really
      important that struct dst_entry refcount is aligned on a cache line.
      
      We cannot use __atribute((aligned)), so manually pad the structure
      for 32 and 64 bit arches.
      
      for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
      for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0
      
      As it is not possible to guess at compile time cache line size,
      we use a generic value of 64 bytes, that satisfies many current arches.
      (Using 128 bytes alignment on 64bit arches would waste 64 bytes)
      
      Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
      break this alignment.
      
      "tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
      (2350 MB/s instead of 2250 MB/s)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5635c10d
  25. 12 11月, 2008 1 次提交
  26. 29 10月, 2008 1 次提交
  27. 05 8月, 2008 1 次提交
  28. 19 7月, 2008 1 次提交
    • S
      tcp: RTT metrics scaling · c1e20f7c
      Stephen Hemminger 提交于
      Some of the metrics (RTT, RTTVAR and RTAX_RTO_MIN) are stored in
      kernel units (jiffies) and this leaks out through the netlink API to
      user space where the units for jiffies are unknown.
      
      This patches changes the kernel to convert to/from milliseconds. This
      changes the ABI, but milliseconds seemed like the most natural unit
      for these parameters.  Values available via syscall in
      /proc/net/rt_cache and netlink will be in milliseconds.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1e20f7c
  29. 28 3月, 2008 1 次提交
  30. 13 3月, 2008 1 次提交
    • Z
      [NET]: Fix tbench regression in 2.6.25-rc1 · f1dd9c37
      Zhang Yanmin 提交于
      Comparing with kernel 2.6.24, tbench result has regression with
      2.6.25-rc1.
      
      1) On 2 quad-core processor stoakley: 4%.
      2) On 4 quad-core processor tigerton: more than 30%.
      
      bisect located below patch.
      
      b4ce9277 is first bad commit
      commit b4ce9277
      Author: Herbert Xu <herbert@gondor.apana.org.au>
      Date:   Tue Nov 13 21:33:32 2007 -0800
      
          [IPV6]: Move nfheader_len into rt6_info
      
          The dst member nfheader_len is only used by IPv6.  It's also currently
          creating a rather ugly alignment hole in struct dst.  Therefore this patch
          moves it from there into struct rt6_info.
      
      Above patch changes the cache line alignment, especially member
      __refcnt. I did a testing by adding 2 unsigned long pading before
      lastuse, so the 3 members, lastuse/__refcnt/__use, are moved to next
      cache line. The performance is recovered.
      
      I created a patch to rearrange the members in struct dst_entry.
      
      With Eric and Valdis Kletnieks's suggestion, I made finer arrangement.
      
      1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So
         sizeof(dst_entry)=200 no matter if CONFIG_NET_CLS_ROUTE=y/n. I
         tested many patches on my 16-core tigerton by moving tclassid to
         different place. It looks like tclassid could also have impact on
         performance.  If moving tclassid before metrics, or just don't move
         tclassid, the performance isn't good. So I move it behind metrics.
      
      2) Add comments before __refcnt.
      
      On 16-core tigerton:
      
      If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18%
      better than the one without the patch;
      
      If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30%
      better than the one without the patch.
      
      With 32bit 2.6.25-rc1 on 8-core stoakley, the new patch doesn't
      introduce regression.
      
      Thank Eric, Valdis, and David!
      Signed-off-by: NZhang Yanmin <yanmin.zhang@intel.com>
      Acked-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1dd9c37
  31. 29 1月, 2008 8 次提交