1. 06 1月, 2015 10 次提交
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
    • D
      net: tcp: add key management to congestion control · c5c6a8ab
      Daniel Borkmann 提交于
      This patch adds necessary infrastructure to the congestion control
      framework for later per route congestion control support.
      
      For a per route congestion control possibility, our aim is to store
      a unique u32 key identifier into dst metrics, which can then be
      mapped into a tcp_congestion_ops struct. We argue that having a
      RTAX key entry is the most simple, generic and easy way to manage,
      and also keeps the memory footprint of dst entries lower on 64 bit
      than with storing a pointer directly, for example. Having a unique
      key id also allows for decoupling actual TCP congestion control
      module management from the FIB layer, i.e. we don't have to care
      about expensive module refcounting inside the FIB at this point.
      
      We first thought of using an IDR store for the realization, which
      takes over dynamic assignment of unused key space and also performs
      the key to pointer mapping in RCU. While doing so, we stumbled upon
      the issue that due to the nature of dynamic key distribution, it
      just so happens, arguably in very rare occasions, that excessive
      module loads and unloads can lead to a possible reuse of previously
      used key space. Thus, previously stale keys in the dst metric are
      now being reassigned to a different congestion control algorithm,
      which might lead to unexpected behaviour. One way to resolve this
      would have been to walk FIBs on the actually rare occasion of a
      module unload and reset the metric keys for each FIB in each netns,
      but that's just very costly.
      
      Therefore, we argue a better solution is to reuse the unique
      congestion control algorithm name member and map that into u32 key
      space through jhash. For that, we split the flags attribute (as it
      currently uses 2 bits only anyway) into two u32 attributes, flags
      and key, so that we can keep the cacheline boundary of 2 cachelines
      on x86_64 and cache the precalculated key at registration time for
      the fast path. On average we might expect 2 - 4 modules being loaded
      worst case perhaps 15, so a key collision possibility is extremely
      low, and guaranteed collision-free on LE/BE for all in-tree modules.
      Overall this results in much simpler code, and all without the
      overhead of an IDR. Due to the deterministic nature, modules can
      now be unloaded, the congestion control algorithm for a specific
      but unloaded key will fall back to the default one, and on module
      reload time it will switch back to the expected algorithm
      transparently.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5c6a8ab
    • D
      net: tcp: refactor reinitialization of congestion control · 29ba4fff
      Daniel Borkmann 提交于
      We can just move this to an extra function and make the code
      a bit more readable, no functional change.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29ba4fff
    • F
      net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3
      Florian Westphal 提交于
      Do the nla validation earlier, outside the write lock.
      
      This is needed by followup patch which needs to be able to call
      request_module (which can sleep) if needed.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e715b6d3
    • D
      net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference · 0409c9a5
      Daniel Borkmann 提交于
      When IPv6 host routes with metrics attached are being added, we fetch
      the metrics store from the dst via COW through dst_metrics_write_ptr(),
      added through commit e5fd387a.
      
      One remaining problem here is that we actually call into inet_getpeer()
      and may end up allocating/creating a new peer from the kmemcache, which
      may fail.
      
      Example trace from perf probe (inet_getpeer:41) where create is 1:
      
      ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
        85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
        85e578 inet_getpeer
        8eb333 ipv6_cow_metrics
        8f10ff fib6_commit_metrics
      
      Therefore, a check for NULL on the return of dst_metrics_write_ptr()
      is necessary here.
      
      Joint work with Florian Westphal.
      
      Fixes: e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely")
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0409c9a5
    • H
      net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined · 6cb69742
      Hubert Sokolowski 提交于
      Add checking whether the call to ndo_dflt_fdb_dump is needed.
      It is not expected to call ndo_dflt_fdb_dump unconditionally
      by some drivers (i.e. qlcnic or macvlan) that defines
      own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
      and don't want ndo_dflt_fdb_dump to be called at all.
      At the same time it is desirable to call the default dump
      function on a bridge device.
      Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
      Add extra checking in br_fdb_dump to avoid duplicate entries
      as now filter_dev can be NULL.
      
      Following tests for filtering have been performed before
      the change and after the patch was applied to make sure
      they are the same and it doesn't break the filtering algorithm.
      
      [root@localhost ~]# cd /root/iproute2-3.18.0/bridge
      [root@localhost bridge]# modprobe dummy
      [root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
      [root@localhost bridge]# brctl addbr br0
      [root@localhost bridge]# brctl addif  br0 dummy0
      [root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
      [root@localhost bridge]# # show all
      [root@localhost bridge]# ./bridge fdb show
      33:33:00:00:00:01 dev p2p1 self permanent
      01:00:5e:00:00:01 dev p2p1 self permanent
      33:33:ff:ac:ce:32 dev p2p1 self permanent
      33:33:00:00:02:02 dev p2p1 self permanent
      01:00:5e:00:00:fb dev p2p1 self permanent
      33:33:00:00:00:01 dev p7p1 self permanent
      01:00:5e:00:00:01 dev p7p1 self permanent
      33:33:ff:79:50:53 dev p7p1 self permanent
      33:33:00:00:02:02 dev p7p1 self permanent
      01:00:5e:00:00:fb dev p7p1 self permanent
      f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
      f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
      33:33:00:00:00:01 dev br0 self permanent
      02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
      02:00:00:12:01:04 dev br0 master br0 permanent
      [root@localhost bridge]# # filter by bridge
      [root@localhost bridge]# ./bridge fdb show br br0
      f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
      f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
      33:33:00:00:00:01 dev br0 self permanent
      02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
      02:00:00:12:01:04 dev br0 master br0 permanent
      [root@localhost bridge]# # filter by port
      [root@localhost bridge]# ./bridge fdb show brport dummy0
      f2:46:50:85:6d:d9 master br0 permanent
      f2:46:50:85:6d:d9 vlan 1 master br0 permanent
      33:33:00:00:00:01 self permanent
      f1:f2:f3:f4:f5:f6 self permanent
      [root@localhost bridge]# # filter by port + bridge
      [root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
      f2:46:50:85:6d:d9 master br0 permanent
      f2:46:50:85:6d:d9 vlan 1 master br0 permanent
      33:33:00:00:00:01 self permanent
      f1:f2:f3:f4:f5:f6 self permanent
      [root@localhost bridge]#
      Signed-off-by: NHubert Sokolowski <hubert.sokolowski@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cb69742
    • T
      ip: Add offset parameter to ip_cmsg_recv · ad6f939a
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad6f939a
    • T
      ip: Add offset parameter to ip_cmsg_recv · 5961de9f
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5961de9f
    • T
      ip: IP cmsg cleanup · c44d13d6
      Tom Herbert 提交于
      Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
      they can be referenced in other source files.
      
      Restructure ip_cmsg_recv to not go through flags using shift, check
      for flags by 'and'. This eliminates both the shift and a conditional
      per flag check.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c44d13d6
    • T
      ip: Move checksum convert defines to inet · 224d019c
      Tom Herbert 提交于
      Move convert_csum from udp_sock to inet_sock. This allows the
      possibility that we can use convert checksum for different types
      of sockets and also allows convert checksum to be enabled from
      inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      224d019c
  2. 05 1月, 2015 4 次提交
    • J
      geneve: Check family when reusing sockets. · 46b1e4f9
      Jesse Gross 提交于
      When searching for an existing socket to reuse, the address family
      is not taken into account - only port number. This means that an
      IPv4 socket could be used for IPv6 traffic and vice versa, which
      is sure to cause problems when passing packets.
      
      It is not possible to trigger this problem currently because the
      only user of Geneve creates just IPv4 sockets. However, that is
      likely to change in the near future.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46b1e4f9
    • J
      geneve: Remove socket hash table. · df5dba8e
      Jesse Gross 提交于
      The hash table for open Geneve ports is used only on creation and
      deletion time. It is not performance critical and is not likely to
      grow to a large number of items. Therefore, this can be changed
      to use a simple linked list.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5dba8e
    • J
      geneve: Simplify locking. · 829a3ada
      Jesse Gross 提交于
      The existing Geneve locking scheme was pulled over directly from
      VXLAN. However, VXLAN has a number of built in mechanisms which make
      the locking more complex and are unlikely to be necessary with Geneve.
      This simplifies the locking to use a basic scheme of a mutex
      when doing updates plus RCU on receive.
      
      In addition to making the code easier to read, this also avoids the
      possibility of a race when creating or destroying sockets since
      UDP sockets and the list of Geneve sockets are protected by different
      locks. After this change, the entire operation is atomic.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829a3ada
    • J
      geneve: Remove workqueue. · 61f3cade
      Jesse Gross 提交于
      The work queue is used only to free the UDP socket upon destruction.
      This is not necessary with Geneve and generally makes the code more
      difficult to reason about. It also introduces nondeterministic
      behavior such as when a socket is rapidly deleted and recreated, which
      could fail as the the deletion happens asynchronously.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61f3cade
  3. 04 1月, 2015 5 次提交
    • T
      netlink: Lockless lookup with RCU grace period in socket release · 21e4902a
      Thomas Graf 提交于
      Defers the release of the socket reference using call_rcu() to
      allow using an RCU read-side protected call to rhashtable_lookup()
      
      This restores behaviour and performance gains as previously
      introduced by e341694e ("netlink: Convert netlink_lookup() to use
      RCU protected hash table") without the side effect of severely
      delayed socket destruction.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21e4902a
    • T
      rhashtable: Per bucket locks & deferred expansion/shrinking · 97defe1e
      Thomas Graf 提交于
      Introduces an array of spinlocks to protect bucket mutations. The number
      of spinlocks per CPU is configurable and selected based on the hash of
      the bucket. This allows for parallel insertions and removals of entries
      which do not share a lock.
      
      The patch also defers expansion and shrinking to a worker queue which
      allows insertion and removal from atomic context. Insertions and
      deletions may occur in parallel to it and are only held up briefly
      while the particular bucket is linked or unzipped.
      
      Mutations of the bucket table pointer is protected by a new mutex, read
      access is RCU protected.
      
      In the event of an expansion or shrinking, the new bucket table allocated
      is exposed as a so called future table as soon as the resize process
      starts.  Lookups, deletions, and insertions will briefly use both tables.
      The future table becomes the main table after an RCU grace period and
      initial linking of the old to the new table was performed. Optimization
      of the chains to make use of the new number of buckets follows only the
      new table is in use.
      
      The side effect of this is that during that RCU grace period, a bucket
      traversal using any rht_for_each() variant on the main table will not see
      any insertions performed during the RCU grace period which would at that
      point land in the future table. The lookup will see them as it searches
      both tables if needed.
      
      Having multiple insertions and removals occur in parallel requires nelems
      to become an atomic counter.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97defe1e
    • T
      nft_hash: Remove rhashtable_remove_pprev() · 897362e4
      Thomas Graf 提交于
      The removal function of nft_hash currently stores a reference to the
      previous element during lookup which is used to optimize removal later
      on. This was possible because a lock is held throughout calling
      rhashtable_lookup() and rhashtable_remove().
      
      With the introdution of deferred table resizing in parallel to lookups
      and insertions, the nftables lock will no longer synchronize all
      table mutations and the stored pprev may become invalid.
      
      Removing this optimization makes removal slightly more expensive on
      average but allows taking the resize cost out of the insert and
      remove path.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      897362e4
    • T
      rhashtable: Convert bucket iterators to take table and index · 88d6ed15
      Thomas Graf 提交于
      This patch is in preparation to introduce per bucket spinlocks. It
      extends all iterator macros to take the bucket table and bucket
      index. It also introduces a new rht_dereference_bucket() to
      handle protected accesses to buckets.
      
      It introduces a barrier() to the RCU iterators to the prevent
      the compiler from caching the first element.
      
      The lockdep verifier is introduced as stub which always succeeds
      and properly implement in the next patch when the locks are
      introduced.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88d6ed15
    • T
      rhashtable: Do hashing inside of rhashtable_lookup_compare() · 8d24c0b4
      Thomas Graf 提交于
      Hash the key inside of rhashtable_lookup_compare() like
      rhashtable_lookup() does. This allows to simplify the hashing
      functions and keep them private.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d24c0b4
  4. 03 1月, 2015 3 次提交
  5. 01 1月, 2015 18 次提交
    • A
      fib_trie: Add tracking value for suffix length · 5405afd1
      Alexander Duyck 提交于
      This change adds a tracking value for the maximum suffix length of all
      prefixes stored in any given tnode.  With this value we can determine if we
      need to backtrace or not based on if the suffix is greater than the pos
      value.
      
      By doing this we can reduce the CPU overhead for lookups in the local table
      as many of the prefixes there are 32b long and have a suffix length of 0
      meaning we can immediately backtrace to the root node without needing to
      test any of the nodes between it and where we ended up.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5405afd1
    • A
      fib_trie: Remove checks for index >= tnode_child_length from tnode_get_child · 21d1f11d
      Alexander Duyck 提交于
      For some reason the compiler doesn't seem to understand that when we are in
      a loop that runs from tnode_child_length - 1 to 0 we don't expect the value
      of tn->bits to change.  As such every call to tnode_get_child was rerunning
      tnode_chile_length which ended up consuming quite a bit of space in the
      resultant assembly code.
      
      I have gone though and verified that in all cases where tnode_get_child
      is used we are either winding though a fixed loop from tnode_child_length -
      1 to 0, or are in a fastpath case where we are verifying the value by
      either checking for any remaining bits after shifting index by bits and
      testing for leaf, or by using tnode_child_length.
      
      size net/ipv4/fib_trie.o
      Before:
         text	   data	    bss	    dec	    hex	filename
        15506	    376	      8	  15890	   3e12	net/ipv4/fib_trie.o
      
      After:
         text	   data	    bss	    dec	    hex	filename
        14827	    376	      8	  15211	   3b6b	net/ipv4/fib_trie.o
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21d1f11d
    • A
      fib_trie: inflate/halve nodes in a more RCU friendly way · 12c081a5
      Alexander Duyck 提交于
      This change pulls the node_set_parent functionality out of put_child_reorg
      and instead leaves that to the function to take care of as well.  By doing
      this we can fully construct the new cluster of tnodes and all of the
      pointers out of it before we start routing pointers into it.
      
      I am suspecting this will likely fix some concurency issues though I don't
      have a good test to show as such.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12c081a5
    • A
      fib_trie: Push tnode flushing down to inflate/halve · fc86a93b
      Alexander Duyck 提交于
      This change pushes the tnode freeing down into the inflate and halve
      functions.  It makes more sense here as we have a better grasp of what is
      going on and when a given cluster of nodes is ready to be freed.
      
      I believe this may address a bug in the freeing logic as well.  For some
      reason if the freelist got to a certain size we would call
      synchronize_rcu().  I'm assuming that what they meant to do is call
      synchronize_rcu() after they had handed off that much memory via
      call_rcu().  As such that is what I have updated the behavior to be.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc86a93b
    • A
      fib_trie: Push assignment of child to parent down into inflate/halve · ff181ed8
      Alexander Duyck 提交于
      This change makes it so that the assignment of the tnode to the parent is
      handled directly within whatever function is currently handling the node be
      it inflate, halve, or resize.  By doing this we can avoid some of the need
      to set NULL pointers in the tree while we are resizing the subnodes.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff181ed8
    • A
      fib_trie: Add functions should_inflate and should_halve · f05a4819
      Alexander Duyck 提交于
      This change pulls the logic for if we should inflate/halve the nodes out
      into separate functions.  It also addresses what I believe is a bug where 1
      full node is all that is needed to keep a node from ever being halved.
      
      Simple script to reproduce the issue:
      	modprobe dummy;	ifconfig dummy0 up
      	for i in `seq 0 255`; do ifconfig dummy0:$i 10.0.${i}.1/24 up; done
      	ifconfig dummy0:256 10.0.255.33/16 up
      	for i in `seq 0 254`; do ifconfig dummy0:$i down; done
      
      Results from /proc/net/fib_triestat
      Before:
      	Local:
      		Aver depth:     3.00
      		Max depth:      4
      		Leaves:         17
      		Prefixes:       18
      		Internal nodes: 11
      		  1: 8  2: 2  10: 1
      		Pointers: 1048
      	Null ptrs: 1021
      	Total size: 11  kB
      After:
      	Local:
      		Aver depth:     3.41
      		Max depth:      5
      		Leaves:         17
      		Prefixes:       18
      		Internal nodes: 12
      		  1: 8  2: 3  3: 1
      		Pointers: 36
      	Null ptrs: 8
      	Total size: 3  kB
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f05a4819
    • A
      fib_trie: Move resize to after inflate/halve · cf3637bb
      Alexander Duyck 提交于
      This change consists of a cut/paste of resize to behind inflate and halve
      so that I could remove the two function prototypes.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf3637bb
    • A
      fib_trie: Push rcu_read_lock/unlock to callers · 345e9b54
      Alexander Duyck 提交于
      This change is to start cleaning up some of the rcu_read_lock/unlock
      handling.  I realized while reviewing the code there are several spots that
      I don't believe are being handled correctly or are masking warnings by
      locally calling rcu_read_lock/unlock instead of calling them at the correct
      level.
      
      A common example is a call to fib_get_table followed by fib_table_lookup.
      The rcu_read_lock/unlock ought to wrap both but there are several spots where
      they were not wrapped.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      345e9b54
    • A
      fib_trie: Use unsigned long for anything dealing with a shift by bits · 98293e8d
      Alexander Duyck 提交于
      This change makes it so that anything that can be shifted by, or compared
      to a value shifted by bits is updated to be an unsigned long.  This is
      mostly a precaution against an insanely huge address space that somehow
      starts coming close to the 2^32 root node size which would require
      something like 1.5 billion addresses.
      
      I chose unsigned long instead of unsigned long long since I do not believe
      it is possible to allocate a 32 bit tnode on a 32 bit system as the memory
      consumed would be 16GB + 28B which exceeds the addressible space for any
      one process.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98293e8d
    • A
      fib_trie: Update meaning of pos to represent unchecked bits · e9b44019
      Alexander Duyck 提交于
      This change moves the pos value to the other side of the "bits" field.  By
      doing this it actually simplifies a significant amount of code in the trie.
      
      For example when halving a tree we know that the bit lost exists at
      oldnode->pos, and if we inflate the tree the new bit being add is at
      tn->pos.  Previously to find those bits you would have to subtract pos and
      bits from the keylength or start with a value of (1 << 31) and then shift
      that.
      
      There are a number of spots throughout the code that benefit from this.  In
      the case of the hot-path searches the main advantage is that we can drop 2
      or more operations from the search path as we no longer need to compute the
      value for the index to be shifted by and can instead just use the raw pos
      value.
      
      In addition the tkey_extract_bits is now defunct and can be replaced by
      get_index since the two operations were doing the same thing, but now
      get_index does it much more quickly as it is only an xor and shift versus a
      pair of shifts and a subtraction.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9b44019
    • A
      fib_trie: Optimize fib_table_insert · 836a0123
      Alexander Duyck 提交于
      This patch updates the fib_table_insert function to take advantage of the
      changes made to improve the performance of fib_table_lookup.  As a result
      the code should be smaller and run faster then the original.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      836a0123
    • A
      fib_trie: Optimize fib_find_node · 939afb06
      Alexander Duyck 提交于
      This patch makes use of the same features I made use of for
      fib_table_lookup to streamline fib_find_node.  The resultant code should be
      smaller and run faster than the original.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      939afb06
    • A
      fib_trie: Optimize fib_table_lookup to avoid wasting time on loops/variables · 9f9e636d
      Alexander Duyck 提交于
      This patch is meant to reduce the complexity of fib_table_lookup by reducing
      the number of variables to the bare minimum while still keeping the same if
      not improved functionality versus the original.
      
      Most of this change was started off by the desire to rid the function of
      chopped_off and current_prefix_length as they actually added very little to
      the function since they only applied when computing the cindex.  I was able
      to replace them mostly with just a check for the prefix match.  As long as
      the prefix between the key and the node being tested was the same we know
      we can search the tnode fully versus just testing cindex 0.
      
      The second portion of the change ended up being a massive reordering.
      Originally the calls to check_leaf were up near the start of the loop, and
      the backtracing and descending into lower levels of tnodes was later.  This
      didn't make much sense as the structure of the tree means the leaves are
      always the last thing to be tested.  As such I reordered things so that we
      instead have a loop that will delve into the tree and only exit when we
      have either found a leaf or we have exhausted the tree.  The advantage of
      rearranging things like this is that we can fully inline check_leaf since
      there is now only one reference to it in the function.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9e636d
    • A
      fib_trie: Merge leaf into tnode · adaf9816
      Alexander Duyck 提交于
      This change makes it so that leaf and tnode are the same struct.  As a
      result there is no need for rt_trie_node anymore since everyting can be
      merged into tnode.
      
      On 32b systems this results in the leaf being 4 bytes larger, however I
      don't know if that is really an issue as this and an eariler patch that
      added bits & pos have increased the size from 20 to 28.  If I am not
      mistaken slub/slab allocate on power of 2 sizes so 20 was likely being
      rounded up to 32 anyway.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adaf9816
    • A
      fib_trie: Merge tnode_free and leaf_free into node_free · 37fd30f2
      Alexander Duyck 提交于
      Both the leaf and the tnode had an rcu_head in them, but they had them in
      slightly different places.  Since we now have them in the same spot and
      know that any node with bits == 0 is a leaf and the rest are either vmalloc
      or kmalloc tnodes depending on the value of bits it makes it easy to combine
      the functions and reduce overhead.
      
      In addition I have taken advantage of the rcu_head pointer to go ahead and
      put together a simple linked list instead of using the tnode pointer as
      this way we can merge either type of structure for freeing.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37fd30f2
    • A
      fib_trie: Make leaf and tnode more uniform · 64c9b6fb
      Alexander Duyck 提交于
      This change makes some fundamental changes to the way leaves and tnodes are
      constructed.  The big differences are:
      1.  Leaves now populate pos and bits indicating their full key size.
      2.  Trie nodes now mask out their lower bits to be consistent with the leaf
      3.  Both structures have been reordered so that rt_trie_node now consisists
          of a much larger region including the pos, bits, and rcu portions of
          the tnode structure.
      
      On 32b systems this will result in the leaf being 4B larger as the pos and
      bits values were added to a hole created by the key as it was only 4B in
      length.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64c9b6fb
    • A
      fib_trie: Update usage stats to be percpu instead of global variables · 8274a97a
      Alexander Duyck 提交于
      The trie usage stats were currently being shared by all threads that were
      calling fib_table_lookup.  As a result when multiple threads were
      performing lookups simultaneously the trie would begin to cache bounce
      between those threads.
      
      In order to prevent this I have updated the usage stats to use a set of
      percpu variables.  By doing this we should be able to avoid the cache
      bouncing and still make use of these stats.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8274a97a
    • S
      gre: allow live address change · bec94d43
      stephen hemminger 提交于
      The GRE tap device supports Ethernet over GRE, but doesn't
      care about the source address of the tunnel, therefore it
      can be changed without bring device down.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec94d43