1. 08 4月, 2019 3 次提交
    • N
      rhashtable: use bit_spin_locks to protect hash bucket. · 8f0db018
      NeilBrown 提交于
      This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
      bucket pointer to lock the hash chain for that bucket.
      
      The benefits of a bit spin_lock are:
       - no need to allocate a separate array of locks.
       - no need to have a configuration option to guide the
         choice of the size of this array
       - locking cost is often a single test-and-set in a cache line
         that will have to be loaded anyway.  When inserting at, or removing
         from, the head of the chain, the unlock is free - writing the new
         address in the bucket head implicitly clears the lock bit.
         For __rhashtable_insert_fast() we ensure this always happens
         when adding a new key.
       - even when lockings costs 2 updates (lock and unlock), they are
         in a cacheline that needs to be read anyway.
      
      The cost of using a bit spin_lock is a little bit of code complexity,
      which I think is quite manageable.
      
      Bit spin_locks are sometimes inappropriate because they are not fair -
      if multiple CPUs repeatedly contend of the same lock, one CPU can
      easily be starved.  This is not a credible situation with rhashtable.
      Multiple CPUs may want to repeatedly add or remove objects, but they
      will typically do so at different buckets, so they will attempt to
      acquire different locks.
      
      As we have more bit-locks than we previously had spinlocks (by at
      least a factor of two) we can expect slightly less contention to
      go with the slightly better cache behavior and reduced memory
      consumption.
      
      To enhance type checking, a new struct is introduced to represent the
        pointer plus lock-bit
      that is stored in the bucket-table.  This is "struct rhash_lock_head"
      and is empty.  A pointer to this needs to be cast to either an
      unsigned lock, or a "struct rhash_head *" to be useful.
      Variables of this type are most often called "bkt".
      
      Previously "pprev" would sometimes point to a bucket, and sometimes a
      ->next pointer in an rhash_head.  As these are now different types,
      pprev is NULL when it would have pointed to the bucket. In that case,
      'blk' is used, together with correct locking protocol.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f0db018
    • N
      rhashtable: allow rht_bucket_var to return NULL. · ff302db9
      NeilBrown 提交于
      Rather than returning a pointer to a static nulls, rht_bucket_var()
      now returns NULL if the bucket doesn't exist.
      This will make the next patch, which stores a bitlock in the
      bucket pointer, somewhat cleaner.
      
      This change involves introducing __rht_bucket_nested() which is
      like rht_bucket_nested(), but doesn't provide the static nulls,
      and changing rht_bucket_nested() to call this and possible
      provide a static nulls - as is still needed for the non-var case.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff302db9
    • N
      rhashtable: use cmpxchg() in nested_table_alloc() · 7a41c294
      NeilBrown 提交于
      nested_table_alloc() relies on the fact that there is
      at most one spinlock allocated for every slot in the top
      level nested table, so it is not possible for two threads
      to try to allocate the same table at the same time.
      
      This assumption is a little fragile (it is not explicit) and is
      unnecessary as cmpxchg() can be used instead.
      
      A future patch will replace the spinlocks by per-bucket bitlocks,
      and then we won't be able to protect the slot pointer with a spinlock.
      
      So replace rcu_assign_pointer() with cmpxchg() - which has equivalent
      barrier properties.
      If it the cmp fails, free the table that was just allocated.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a41c294
  2. 22 3月, 2019 3 次提交
    • N
      rhashtable: rename rht_for_each*continue as *from. · f7ad68bf
      NeilBrown 提交于
      The pattern set by list.h is that for_each..continue()
      iterators start at the next entry after the given one,
      while for_each..from() iterators start at the given
      entry.
      
      The rht_for_each*continue() iterators are documented as though the
      start at the 'next' entry, but actually start at the given entry,
      and they are used expecting that behaviour.
      So fix the documentation and change the names to *from for consistency
      with list.h
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7ad68bf
    • N
      rhashtable: don't hold lock on first table throughout insertion. · 4feb7c7a
      NeilBrown 提交于
      rhashtable_try_insert() currently holds a lock on the bucket in
      the first table, while also locking buckets in subsequent tables.
      This is unnecessary and looks like a hold-over from some earlier
      version of the implementation.
      
      As insert and remove always lock a bucket in each table in turn, and
      as insert only inserts in the final table, there cannot be any races
      that are not covered by simply locking a bucket in each table in turn.
      
      When an insert call reaches that last table it can be sure that there
      is no matchinf entry in any other table as it has searched them all, and
      insertion never happens anywhere but in the last table.  The fact that
      code tests for the existence of future_tbl while holding a lock on
      the relevant bucket ensures that two threads inserting the same key
      will make compatible decisions about which is the "last" table.
      
      This simplifies the code and allows the ->rehash field to be
      discarded.
      
      We still need a way to ensure that a dead bucket_table is never
      re-linked by rhashtable_walk_stop().  This can be achieved by calling
      call_rcu() inside the locked region, and checking with
      rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the
      bucket table is empty and dead.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4feb7c7a
    • H
      rhashtable: Still do rehash when we get EEXIST · 408f13ef
      Herbert Xu 提交于
      As it stands if a shrink is delayed because of an outstanding
      rehash, we will go into a rescheduling loop without ever doing
      the rehash.
      
      This patch fixes this by still carrying out the rehash and then
      rescheduling so that we can shrink after the completion of the
      rehash should it still be necessary.
      
      The return value of EEXIST captures this case and other cases
      (e.g., another thread expanded/rehashed the table at the same
      time) where we should still proceed with the rehash.
      
      Fixes: da20420f ("rhashtable: Add nested tables")
      Reported-by: NJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: NJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      408f13ef
  3. 22 2月, 2019 1 次提交
  4. 04 12月, 2018 1 次提交
    • N
      rhashtable: detect when object movement between tables might have invalidated a lookup · 82208d0d
      NeilBrown 提交于
      Some users of rhashtables might need to move an object from one table
      to another -  this appears to be the reason for the incomplete usage
      of NULLS markers.
      
      To support these, we store a unique NULLS_MARKER at the end of
      each chain, and when a search fails to find a match, we check
      if the NULLS marker found was the expected one.  If not, the search
      may not have examined all objects in the target bucket, so it is
      repeated.
      
      The unique NULLS_MARKER is derived from the address of the
      head of the chain.  As this cannot be derived at load-time the
      static rhnull in rht_bucket_nested() needs to be initialised
      at run time.
      
      Any caller of a lookup function must still be prepared for the
      possibility that the object returned is in a different table - it
      might have been there for some time.
      
      Note that this does NOT provide support for other uses of
      NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing
      the key of an object and re-inserting it in the same table.
      These could only be done safely if new objects were inserted
      at the *start* of a hash chain, and that is not currently the case.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82208d0d
  5. 23 8月, 2018 2 次提交
  6. 21 8月, 2018 1 次提交
  7. 19 7月, 2018 1 次提交
  8. 10 7月, 2018 1 次提交
    • T
      rhashtable: add restart routine in rhashtable_free_and_destroy() · 0026129c
      Taehee Yoo 提交于
      rhashtable_free_and_destroy() cancels re-hash deferred work
      then walks and destroys elements. at this moment, some elements can be
      still in future_tbl. that elements are not destroyed.
      
      test case:
      nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
      all elements of sets before destroying sets and chains.
      But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
      so that splat occurred.
      
      test script:
         %cat test.nft
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		}
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		   }
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
      
         %while :; do nft -f test.nft; done
      
      Splat looks like:
      [  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
      [  200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
      [  200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
      [  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
      [  200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
      [  200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
      [  200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
      [  200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
      [  200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
      [  200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
      [  200.906354] FS:  00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
      [  200.915533] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
      [  200.930353] Call Trace:
      [  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
      [  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
      [  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
      [  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
      [  200.959532]  ? nla_parse+0xab/0x230
      [  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
      [  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
      [  200.975525]  ? debug_show_all_locks+0x290/0x290
      [  200.980363]  ? debug_show_all_locks+0x290/0x290
      [  200.986356]  ? sched_clock_cpu+0x132/0x170
      [  200.990352]  ? find_held_lock+0x39/0x1b0
      [  200.994355]  ? sched_clock_local+0x10d/0x130
      [  200.999531]  ? memset+0x1f/0x40
      
      V2:
       - free all tables requested by Herbert Xu
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0026129c
  9. 03 7月, 2018 1 次提交
  10. 22 6月, 2018 6 次提交
  11. 25 4月, 2018 3 次提交
  12. 01 4月, 2018 1 次提交
  13. 07 3月, 2018 1 次提交
  14. 11 12月, 2017 3 次提交
    • T
      rhashtable: Call library function alloc_bucket_locks · 64e0cd0d
      Tom Herbert 提交于
      To allocate the array of bucket locks for the hash table we now
      call library function alloc_bucket_spinlocks. This function is
      based on the old alloc_bucket_locks in rhashtable and should
      produce the same effect.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64e0cd0d
    • T
      rhashtable: Add rhastable_walk_peek · 2db54b47
      Tom Herbert 提交于
      This function is like rhashtable_walk_next except that it only returns
      the current element in the inter and does not advance the iter.
      
      This patch also creates __rhashtable_walk_find_next. It finds the next
      element in the table when the entry cached in iter is NULL or at the end
      of a slot. __rhashtable_walk_find_next is called from
      rhashtable_walk_next and rhastable_walk_peek.
      
      end_of_table is an added field to the iter structure. This indicates
      that the end of table was reached (walker.tbl being NULL is not a
      sufficient condition for end of table).
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2db54b47
    • T
      rhashtable: Change rhashtable_walk_start to return void · 97a6ec4a
      Tom Herbert 提交于
      Most callers of rhashtable_walk_start don't care about a resize event
      which is indicated by a return value of -EAGAIN. So calls to
      rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
      like this is common:
      
             ret = rhashtable_walk_start(rhiter);
             if (ret && ret != -EAGAIN)
                     goto out;
      
      Since zero and -EAGAIN are the only possible return values from the
      function this check is pointless. The condition never evaluates to true.
      
      This patch changes rhashtable_walk_start to return void. This simplifies
      code for the callers that ignore -EAGAIN. For the few cases where the
      caller cares about the resize event, particularly where the table can be
      walked in mulitple parts for netlink or seq file dump, the function
      rhashtable_walk_start_check has been added that returns -EAGAIN on a
      resize event.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97a6ec4a
  15. 20 9月, 2017 1 次提交
  16. 11 7月, 2017 1 次提交
  17. 20 6月, 2017 1 次提交
  18. 09 5月, 2017 1 次提交
  19. 02 5月, 2017 1 次提交
  20. 28 4月, 2017 1 次提交
  21. 27 4月, 2017 2 次提交
  22. 19 4月, 2017 1 次提交
  23. 02 3月, 2017 1 次提交
  24. 27 2月, 2017 2 次提交