1. 25 7月, 2015 1 次提交
  2. 29 6月, 2015 1 次提交
    • A
      ipv4: fix RCU lockdep warning from linkdown changes · 96ac5cc9
      Andy Gospodarek 提交于
      The following lockdep splat was seen due to the wrong context for
      grabbing in_dev.
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244 Not tainted
      -------------------------------
      include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      2 locks held by ip/403:
       #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81453305>] rtnl_lock+0x17/0x19
       #1:  ((inetaddr_chain).rwsem){.+.+.+}, at: [<ffffffff8105c327>] __blocking_notifier_call_chain+0x35/0x6a
      
      stack backtrace:
      CPU: 2 PID: 403 Comm: ip Not tainted 4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244
       0000000000000001 ffff8800b189b728 ffffffff8150a542 ffffffff8107a8b3
       ffff880037bbea40 ffff8800b189b758 ffffffff8107cb74 ffff8800379dbd00
       ffff8800bec85800 ffff8800bf9e13c0 00000000000000ff ffff8800b189b7d8
      Call Trace:
       [<ffffffff8150a542>] dump_stack+0x4c/0x6e
       [<ffffffff8107a8b3>] ? up+0x39/0x3e
       [<ffffffff8107cb74>] lockdep_rcu_suspicious+0xf7/0x100
       [<ffffffff814b63c3>] fib_dump_info+0x227/0x3e2
       [<ffffffff814b6624>] rtmsg_fib+0xa6/0x116
       [<ffffffff814b978f>] fib_table_insert+0x316/0x355
       [<ffffffff814b362e>] fib_magic+0xb7/0xc7
       [<ffffffff814b4803>] fib_add_ifaddr+0xb1/0x13b
       [<ffffffff814b4d09>] fib_inetaddr_event+0x36/0x90
       [<ffffffff8105c086>] notifier_call_chain+0x4c/0x71
       [<ffffffff8105c340>] __blocking_notifier_call_chain+0x4e/0x6a
       [<ffffffff8105c370>] blocking_notifier_call_chain+0x14/0x16
       [<ffffffff814a7f50>] __inet_insert_ifa+0x1a5/0x1b3
       [<ffffffff814a894d>] inet_rtm_newaddr+0x350/0x35f
       [<ffffffff81457866>] rtnetlink_rcv_msg+0x17b/0x18a
       [<ffffffff8107e7c3>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffff8146965f>] ? netlink_deliver_tap+0x1cb/0x1f7
       [<ffffffff814576eb>] ? rtnl_newlink+0x72a/0x72a
      ...
      
      This patch resolves that splat.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96ac5cc9
  3. 24 6月, 2015 2 次提交
    • A
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek 提交于
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eeb075f
    • A
      net: track link-status of ipv4 nexthops · 8a3d0316
      Andy Gospodarek 提交于
      Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
      reachable via an interface where carrier is off.  No action is taken,
      but additional flags are passed to userspace to indicate carrier status.
      
      This also includes a cleanup to fib_disable_ip to more clearly indicate
      what event made the function call to replace the more cryptic force
      option previously used.
      
      v2: Split out kernel functionality into 2 patches, this patch simply
      sets and clears new nexthop flag RTNH_F_LINKDOWN.
      
      v3: Cleanups suggested by Alex as well as a bug noticed in
      fib_sync_down_dev and fib_sync_up when multipath was not enabled.
      
      v5: Whitespace and variable declaration fixups suggested by Dave.
      
      v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
      all.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a3d0316
  4. 03 5月, 2015 1 次提交
  5. 04 4月, 2015 1 次提交
  6. 01 4月, 2015 2 次提交
  7. 13 3月, 2015 1 次提交
  8. 28 2月, 2015 1 次提交
  9. 26 1月, 2015 1 次提交
  10. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  11. 06 1月, 2015 1 次提交
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
  12. 15 10月, 2014 1 次提交
  13. 06 9月, 2014 1 次提交
  14. 08 5月, 2014 1 次提交
  15. 17 4月, 2014 1 次提交
    • C
      ipv4, fib: pass LOOPBACK_IFINDEX instead of 0 to flowi4_iif · 6a662719
      Cong Wang 提交于
      As suggested by Julian:
      
      	Simply, flowi4_iif must not contain 0, it does not
      	look logical to ignore all ip rules with specified iif.
      
      because in fib_rule_match() we do:
      
              if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
                      goto out;
      
      flowi4_iif should be LOOPBACK_IFINDEX by default.
      
      We need to move LOOPBACK_IFINDEX to include/net/flow.h:
      
      1) It is mostly used by flowi_iif
      
      2) Fix the following compile error if we use it in flow.h
      by the patches latter:
      
      In file included from include/linux/netfilter.h:277:0,
                       from include/net/netns/netfilter.h:5,
                       from include/net/net_namespace.h:21,
                       from include/linux/netdevice.h:43,
                       from include/linux/icmpv6.h:12,
                       from include/linux/ipv6.h:61,
                       from include/net/ipv6.h:16,
                       from include/linux/sunrpc/clnt.h:27,
                       from include/linux/nfs_fs.h:30,
                       from init/do_mounts.c:32:
      include/net/flow.h: In function ‘flowi4_init_output’:
      include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a662719
  16. 29 12月, 2013 1 次提交
  17. 19 10月, 2013 1 次提交
  18. 29 6月, 2013 1 次提交
    • T
      ipv4: use next hop exceptions also for input routes · 2ffae99d
      Timo Teräs 提交于
      Commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops)
      assmued that "locally destined, and routed packets, never trigger
      PMTU events or redirects that will be processed by us".
      
      However, it seems that tunnel devices do trigger PMTU events in certain
      cases. At least ip_gre, ip6_gre, sit, and ipip do use the inner flow's
      skb_dst(skb)->ops->update_pmtu to propage mtu information from the
      outer flows. These can cause the inner flow mtu to be decreased. If
      next hop exceptions are not consulted for pmtu, IP fragmentation will
      not be done properly for these routes.
      
      It also seems that we really need to have the PMTU information always
      for netfilter TCPMSS clamp-to-pmtu feature to work properly.
      
      So for the time being, cache separate copies of input routes for
      each next hop exception.
      Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ffae99d
  19. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  20. 23 10月, 2012 1 次提交
  21. 09 10月, 2012 1 次提交
  22. 05 10月, 2012 1 次提交
    • E
      ipv4: add a fib_type to fib_info · f4ef85bb
      Eric Dumazet 提交于
      commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops.)
      introduced a regression for forwarding.
      
      This was hard to reproduce but the symptom was that packets were
      delivered to local host instead of being forwarded.
      
      David suggested to add fib_type to fib_info so that we dont
      inadvertently share same fib_info for different purposes.
      
      With help from Julian Anastasov who provided very helpful
      hints, reproduced here :
      
      <quote>
              Can it be a problem related to fib_info reuse
      from different routes. For example, when local IP address
      is created for subnet we have:
      
      broadcast 192.168.0.255 dev DEV  proto kernel  scope link  src
      192.168.0.1
      192.168.0.0/24 dev DEV  proto kernel  scope link  src 192.168.0.1
      local 192.168.0.1 dev DEV  proto kernel  scope host  src 192.168.0.1
      
              The "dev DEV  proto kernel  scope link  src 192.168.0.1" is
      a reused fib_info structure where we put cached routes.
      The result can be same fib_info for 192.168.0.255 and
      192.168.0.0/24. RTN_BROADCAST is cached only for input
      routes. Incoming broadcast to 192.168.0.255 can be cached
      and can cause problems for traffic forwarded to 192.168.0.0/24.
      So, this patch should solve the problem because it
      separates the broadcast from unicast traffic.
      
              And the ip_route_input_slow caching will work for
      local and broadcast input routes (above routes 1 and 3) just
      because they differ in scope and use different fib_info.
      
      </quote>
      
      Many thanks to Chris Clayton for his patience and help.
      Reported-by: NChris Clayton <chris2553@googlemail.com>
      Bisected-by: NChris Clayton <chris2553@googlemail.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4ef85bb
  23. 11 9月, 2012 1 次提交
  24. 01 8月, 2012 3 次提交
    • D
      c5038a83
    • E
      ipv4: percpu nh_rth_output cache · d26b3a7c
      Eric Dumazet 提交于
      Input path is mostly run under RCU and doesnt touch dst refcnt
      
      But output path on forwarding or UDP workloads hits
      badly dst refcount, and we have lot of false sharing, for example
      in ipv4_mtu() when reading rt->rt_pmtu
      
      Using a percpu cache for nh_rth_output gives a nice performance
      increase at a small cost.
      
      24 udpflood test on my 24 cpu machine (dummy0 output device)
      (each process sends 1.000.000 udp frames, 24 processes are started)
      
      before : 5.24 s
      after : 2.06 s
      For reference, time on linux-3.5 : 6.60 s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d26b3a7c
    • E
      ipv4: Restore old dst_free() behavior. · 54764bb6
      Eric Dumazet 提交于
      commit 404e0a8b (net: ipv4: fix RCU races on dst refcounts) tried
      to solve a race but added a problem at device/fib dismantle time :
      
      We really want to call dst_free() as soon as possible, even if sockets
      still have dst in their cache.
      dst_release() calls in free_fib_info_rcu() are not welcomed.
      
      Root of the problem was that now we also cache output routes (in
      nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
      rt_free(), because output route lookups are done in process context.
      
      Based on feedback and initial patch from David Miller (adding another
      call_rcu_bh() call in fib, but it appears it was not the right fix)
      
      I left the inet_sk_rx_dst_set() helper and added __rcu attributes
      to nh_rth_output and nh_rth_input to better document what is going on in
      this code.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54764bb6
  25. 31 7月, 2012 1 次提交
    • E
      net: ipv4: fix RCU races on dst refcounts · 404e0a8b
      Eric Dumazet 提交于
      commit c6cffba4 (ipv4: Fix input route performance regression.)
      added various fatal races with dst refcounts.
      
      crashes happen on tcp workloads if routes are added/deleted at the same
      time.
      
      The dst_free() calls from free_fib_info_rcu() are clearly racy.
      
      We need instead regular dst refcounting (dst_release()) and make
      sure dst_release() is aware of RCU grace periods :
      
      Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
      before dst destruction for cached dst
      
      Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
      to make sure we dont increase a zero refcount (On a dst currently
      waiting an rcu grace period before destruction)
      
      rt_cache_route() must take a reference on the new cached route, and
      release it if was not able to install it.
      
      With this patch, my machines survive various benchmarks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      404e0a8b
  26. 27 7月, 2012 1 次提交
  27. 21 7月, 2012 2 次提交
    • D
      ipv4: Cache input routes in fib_info nexthops. · d2d68ba9
      David S. Miller 提交于
      Caching input routes is slightly simpler than output routes, since we
      don't need to be concerned with nexthop exceptions.  (locally
      destined, and routed packets, never trigger PMTU events or redirects
      that will be processed by us).
      
      However, we have to elide caching for the DIRECTSRC and non-zero itag
      cases.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2d68ba9
    • D
      ipv4: Cache output routes in fib_info nexthops. · f2bb4bed
      David S. Miller 提交于
      If we have an output route that lacks nexthop exceptions, we can cache
      it in the FIB info nexthop.
      
      Such routes will have DST_HOST cleared because such routes refer to a
      family of destinations, rather than just one.
      
      The sequence of the handling of exceptions during route lookup is
      adjusted to make the logic work properly.
      
      Before we allocate the route, we lookup the exception.
      
      Then we know if we will cache this route or not, and therefore whether
      DST_HOST should be set on the allocated route.
      
      Then we use DST_HOST to key off whether we should store the resulting
      route, during rt_set_nexthop(), in the FIB nexthop cache.
      
      With help from Eric Dumazet.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2bb4bed
  28. 18 7月, 2012 1 次提交
  29. 17 7月, 2012 1 次提交
    • D
      ipv4: Add FIB nexthop exceptions. · 4895c771
      David S. Miller 提交于
      In a regime where we have subnetted route entries, we need a way to
      store persistent storage about destination specific learned values
      such as redirects and PMTU values.
      
      This is implemented here via nexthop exceptions.
      
      The initial implementation is a 2048 entry hash table with relaiming
      starting at chain length 5.  A more sophisticated scheme can be
      devised if that proves necessary.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4895c771
  30. 11 7月, 2012 1 次提交
  31. 06 7月, 2012 1 次提交
  32. 29 6月, 2012 1 次提交
    • D
      ipv4: Elide fib_validate_source() completely when possible. · 7a9bc9b8
      David S. Miller 提交于
      If rpfilter is off (or the SKB has an IPSEC path) and there are not
      tclassid users, we don't have to do anything at all when
      fib_validate_source() is invoked besides setting the itag to zero.
      
      We monitor tclassid uses with a counter (modified only under RTNL and
      marked __read_mostly) and we protect the fib_validate_source() real
      work with a test against this counter and whether rpfilter is to be
      done.
      
      Having a way to know whether we need no tclassid processing or not
      also opens the door for future optimized rpfilter algorithms that do
      not perform full FIB lookups.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a9bc9b8
  33. 18 6月, 2012 1 次提交
  34. 24 5月, 2012 1 次提交
    • Y
      ipv4: fix the rcu race between free_fib_info and ip_route_output_slow · e49cc0da
      Yanmin Zhang 提交于
      We hit a kernel OOPS.
      
      <3>[23898.789643] BUG: sleeping function called from invalid context at
      /data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
      <3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
      Thread-6683
      <4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
      to suspend...
      <4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
      3.0.8-137685-ge7742f9 #1
      <4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
      to suspend...
      <4>[23899.904225] Call Trace:
      <4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
      <4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
      <4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
      <4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
      <4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
      <4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
      <4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
      <4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
      <4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
      <4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
      <4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
      <4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
      <4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
      <4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
      <4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50
      
      Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
      fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.
      
      With the patch, we ran MTBF testing on Android mobile for 12 hours
      and didn't trigger the issue.
      
      Thank Eric for very detailed review/checking the issue.
      Signed-off-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Signed-off-by: NKun Jiang <kunx.jiang@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e49cc0da
  35. 02 4月, 2012 1 次提交