1. 27 5月, 2017 1 次提交
    • E
      ipv4: add reference counting to metrics · 3fb07daf
      Eric Dumazet 提交于
      Andrey Konovalov reported crashes in ipv4_mtu()
      
      I could reproduce the issue with KASAN kernels, between
      10.246.7.151 and 10.246.7.152 :
      
      1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
      
      2) At the same time run following loop :
      while :
      do
       ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
       ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
      done
      
      Cong Wang attempted to add back rt->fi in commit
      82486aa6 ("ipv4: restore rt->fi for reference counting")
      but this proved to add some issues that were complex to solve.
      
      Instead, I suggested to add a refcount to the metrics themselves,
      being a standalone object (in particular, no reference to other objects)
      
      I tried to make this patch as small as possible to ease its backport,
      instead of being super clean. Note that we believe that only ipv4 dst
      need to take care of the metric refcount. But if this is wrong,
      this patch adds the basic infrastructure to extend this to other
      families.
      
      Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
      for his efforts on this problem.
      
      Fixes: 2860583f ("ipv4: Kill rt->fi")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fb07daf
  2. 12 2月, 2017 1 次提交
  3. 08 2月, 2017 2 次提交
  4. 26 4月, 2016 1 次提交
    • J
      route: move lwtunnel state to a single place · 0868e253
      Jiri Benc 提交于
      Commit 751a587a ("route: fix breakage after moving lwtunnel state")
      moved lwtstate to the end of dst_entry for 32bit archs. This makes it share
      the cacheline with __refcnt which had an unkown effect on performance. For
      this reason, the pointer was kept in place for 64bit archs.
      
      However, later performance measurements showed this is of no concern. It
      turns out that every performance sensitive path that accesses lwtstate
      accesses also struct rtable or struct rt6_info which share the same cache
      line.
      
      Thus, to get rid of a few #ifdefs, move the field to the end of the struct
      also for 64bit.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0868e253
  5. 19 3月, 2016 1 次提交
  6. 15 12月, 2015 1 次提交
    • E
      net: fix IP early demux races · 5037e9ef
      Eric Dumazet 提交于
      David Wilder reported crashes caused by dst reuse.
      
      <quote David>
        I am seeing a crash on a distro V4.2.3 kernel caused by a double
        release of a dst_entry.  In ipv4_dst_destroy() the call to
        list_empty() finds a poisoned next pointer, indicating the dst_entry
        has already been removed from the list and freed. The crash occurs
        18 to 24 hours into a run of a network stress exerciser.
      </quote>
      
      Thanks to his detailed report and analysis, we were able to understand
      the core issue.
      
      IP early demux can associate a dst to skb, after a lookup in TCP/UDP
      sockets.
      
      When socket cache is not properly set, we want to store into
      sk->sk_dst_cache the dst for future IP early demux lookups,
      by acquiring a stable refcount on the dst.
      
      Problem is this acquisition is simply using an atomic_inc(),
      which works well, unless the dst was queued for destruction from
      dst_release() noticing dst refcount went to zero, if DST_NOCACHE
      was set on dst.
      
      We need to make sure current refcount is not zero before incrementing
      it, or risk double free as David reported.
      
      This patch, being a stable candidate, adds two new helpers, and use
      them only from IP early demux problematic paths.
      
      It might be possible to merge in net-next skb_dst_force() and
      skb_dst_force_safe(), but I prefer having the smallest patch for stable
      kernels : Maybe some skb_dst_force() callers do not expect skb->dst
      can suddenly be cleared.
      
      Can probably be backported back to linux-3.6 kernels
      Reported-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Tested-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5037e9ef
  7. 08 10月, 2015 2 次提交
  8. 26 9月, 2015 1 次提交
  9. 18 9月, 2015 2 次提交
  10. 01 9月, 2015 1 次提交
    • D
      tcp: use dctcp if enabled on the route to the initiator · c3a8d947
      Daniel Borkmann 提交于
      Currently, the following case doesn't use DCTCP, even if it should:
      A responder has f.e. Cubic as system wide default, but for a specific
      route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
      initiating host then uses DCTCP as congestion control, but since the
      initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
      and we have to fall back to Reno after 3WHS completes.
      
      We were thinking on how to solve this in a minimal, non-intrusive
      way without bloating tcp_ecn_create_request() needlessly: lets cache
      the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
      is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
      contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
      to only do a single metric feature lookup inside tcp_ecn_create_request().
      
      Joint work with Florian Westphal.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3a8d947
  11. 28 8月, 2015 1 次提交
  12. 24 8月, 2015 1 次提交
  13. 21 8月, 2015 1 次提交
  14. 22 7月, 2015 1 次提交
    • T
      dst: Metadata destinations · f38a9eb1
      Thomas Graf 提交于
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38a9eb1
  15. 13 5月, 2015 1 次提交
  16. 02 5月, 2015 1 次提交
  17. 12 2月, 2015 1 次提交
  18. 16 9月, 2014 2 次提交
  19. 16 4月, 2014 1 次提交
  20. 28 3月, 2014 1 次提交
    • M
      ipv6: do not overwrite inetpeer metrics prematurely · e5fd387a
      Michal Kubeček 提交于
      If an IPv6 host route with metrics exists, an attempt to add a
      new route for the same target with different metrics fails but
      rewrites the metrics anyway:
      
      12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000
      12sp0:~ # ip -6 route show
      fe80::/64 dev eth0  proto kernel  metric 256
      fec0::1 dev eth0  metric 1024  rto_min lock 1s
      12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1500
      RTNETLINK answers: File exists
      12sp0:~ # ip -6 route show
      fe80::/64 dev eth0  proto kernel  metric 256
      fec0::1 dev eth0  metric 1024  rto_min lock 1.5s
      
      This is caused by all IPv6 host routes using the metrics in
      their inetpeer (or the shared default). This also holds for the
      new route created in ip6_route_add() which shares the metrics
      with the already existing route and thus ip6_route_add()
      rewrites the metrics even if the new route ends up not being
      used at all.
      
      Another problem is that old metrics in inetpeer can reappear
      unexpectedly for a new route, e.g.
      
      12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000
      12sp0:~ # ip route del fec0::1
      12sp0:~ # ip route add fec0::1 dev eth0
      12sp0:~ # ip route change fec0::1 dev eth0 hoplimit 10
      12sp0:~ # ip -6 route show
      fe80::/64 dev eth0  proto kernel  metric 256
      fec0::1 dev eth0  metric 1024  hoplimit 10 rto_min lock 1s
      
      Resolve the first problem by moving the setting of metrics down
      into fib6_add_rt2node() to the point we are sure we are
      inserting the new route into the tree. Second problem is
      addressed by introducing new flag DST_METRICS_FORCE_OVERWRITE
      which is set for a new host route in ip6_route_add() and makes
      ipv6_cow_metrics() always overwrite the metrics in inetpeer
      (even if they are not "new"); it is reset after that.
      
      v5: use a flag in _metrics member rather than one in flags
      
      v4: fix a typo making a condition always true (thanks to Hannes
      Frederic Sowa)
      
      v3: rewritten based on David Miller's idea to move setting the
      metrics (and allocation in non-host case) down to the point we
      already know the route is to be inserted. Also rebased to
      net-next as it is quite late in the cycle.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5fd387a
  21. 07 3月, 2014 1 次提交
  22. 18 12月, 2013 1 次提交
    • T
      net: Add utility functions to clear rxhash · 7539fadc
      Tom Herbert 提交于
      In several places 'skb->rxhash = 0' is being done to clear the
      rxhash value in an skb.  This does not clear l4_rxhash which could
      still be set so that the rxhash wouldn't be recalculated on subsequent
      call to skb_get_rxhash.  This patch adds an explict function to clear
      all the rxhash related information in the skb properly.
      
      skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
      l4_rxhash.
      
      Fixed up places where 'skb->rxhash = 0' was being called.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7539fadc
  23. 18 10月, 2013 1 次提交
  24. 21 9月, 2013 1 次提交
  25. 04 9月, 2013 1 次提交
  26. 15 3月, 2013 1 次提交
  27. 21 2月, 2013 1 次提交
  28. 06 2月, 2013 1 次提交
    • S
      xfrm: Add a state resolution packet queue · a0073fe1
      Steffen Klassert 提交于
      As the default, we blackhole packets until the key manager resolves
      the states. This patch implements a packet queue where IPsec packets
      are queued until the states are resolved. We generate a dummy xfrm
      bundle, the output routine of the returned route enqueues the packet
      to a per policy queue and arms a timer that checks for state resolution
      when dst_output() is called. Once the states are resolved, the packets
      are sent out of the queue. If the states are not resolved after some
      time, the queue is flushed.
      
      This patch keeps the defaut behaviour to blackhole packets as long
      as we have no states. To enable the packet queue the sysctl
      xfrm_larval_drop must be switched off.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      a0073fe1
  29. 09 8月, 2012 1 次提交
    • E
      net: force dst_default_metrics to const section · a37e6e34
      Eric Dumazet 提交于
      While investigating on network performance problems, I found this little
      gem :
      
      $ nm -v vmlinux | grep -1 dst_default_metrics
      ffffffff82736540 b busy.46605
      ffffffff82736560 B dst_default_metrics
      ffffffff82736598 b dst_busy_list
      
      Apparently, declaring a const array without initializer put it in
      (writeable) bss section, in middle of possibly often dirtied cache
      lines.
      
      Since we really want dst_default_metrics be const to avoid any possible
      false sharing and catch any buggy writes, I force a null initializer.
      
      ffffffff818a4c20 R dst_default_metrics
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a37e6e34
  30. 08 8月, 2012 1 次提交
    • E
      net: output path optimizations · 425f09ab
      Eric Dumazet 提交于
      1) Avoid dirtying neighbour's confirmed field.
      
        TCP workloads hits this cache line for each incoming ACK.
        Lets write n->confirmed only if there is a jiffie change.
      
      2) Optimize neigh_hh_output() for the common Ethernet case, were
         hh_len is less than 16 bytes. Replace the memcpy() call
         by two inlined 64bit load/stores on x86_64.
      
      Bench results using udpflood test, with -C option (MSG_CONFIRM flag
      added to sendto(), to reproduce the n->confirmed dirtying on UDP)
      
      24 threads doing 1.000.000 UDP sendto() on dummy device, 4 runs.
      
      before : 2.247s, 2.235s, 2.247s, 2.318s
      after  : 1.884s, 1.905s, 1.891s, 1.895s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      425f09ab
  31. 01 8月, 2012 1 次提交
    • E
      ipv4: Restore old dst_free() behavior. · 54764bb6
      Eric Dumazet 提交于
      commit 404e0a8b (net: ipv4: fix RCU races on dst refcounts) tried
      to solve a race but added a problem at device/fib dismantle time :
      
      We really want to call dst_free() as soon as possible, even if sockets
      still have dst in their cache.
      dst_release() calls in free_fib_info_rcu() are not welcomed.
      
      Root of the problem was that now we also cache output routes (in
      nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
      rt_free(), because output route lookups are done in process context.
      
      Based on feedback and initial patch from David Miller (adding another
      call_rcu_bh() call in fib, but it appears it was not the right fix)
      
      I left the inet_sk_rx_dst_set() helper and added __rcu attributes
      to nh_rth_output and nh_rth_input to better document what is going on in
      this code.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54764bb6
  32. 31 7月, 2012 1 次提交
    • E
      net: ipv4: fix RCU races on dst refcounts · 404e0a8b
      Eric Dumazet 提交于
      commit c6cffba4 (ipv4: Fix input route performance regression.)
      added various fatal races with dst refcounts.
      
      crashes happen on tcp workloads if routes are added/deleted at the same
      time.
      
      The dst_free() calls from free_fib_info_rcu() are clearly racy.
      
      We need instead regular dst refcounting (dst_release()) and make
      sure dst_release() is aware of RCU grace periods :
      
      Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
      before dst destruction for cached dst
      
      Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
      to make sure we dont increase a zero refcount (On a dst currently
      waiting an rcu grace period before destruction)
      
      rt_cache_route() must take a reference on the new cached route, and
      release it if was not able to install it.
      
      With this patch, my machines survive various benchmarks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      404e0a8b
  33. 21 7月, 2012 2 次提交
  34. 11 7月, 2012 1 次提交
  35. 05 7月, 2012 1 次提交