1. 29 7月, 2020 1 次提交
  2. 07 6月, 2020 1 次提交
    • H
      rhashtable: Drop raw RCU deref in nested_table_free · 4a3084aa
      Herbert Xu 提交于
      This patch replaces some unnecessary uses of rcu_dereference_raw
      in the rhashtable code with rcu_dereference_protected.
      
      The top-level nested table entry is only marked as RCU because it
      shares the same type as the tree entries underneath it.  So it
      doesn't need any RCU protection.
      
      We also don't need RCU protection when we're freeing a nested RCU
      table because by this stage we've long passed a memory barrier
      when anyone could change the nested table.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a3084aa
  3. 19 6月, 2019 1 次提交
  4. 17 5月, 2019 2 次提交
  5. 13 4月, 2019 5 次提交
    • N
      rhashtable: use BIT(0) for locking. · ca0b709d
      NeilBrown 提交于
      As reported by Guenter Roeck, the new bit-locking using
      BIT(1) doesn't work on the m68k architecture.  m68k only requires
      2-byte alignment for words and longwords, so there is only one
      unused bit in pointers to structs - We current use two, one for the
      NULLS marker at the end of the linked list, and one for the bit-lock
      in the head of the list.
      
      The two uses don't need to conflict as we never need the head of the
      list to be a NULLS marker - the marker is only needed to check if an
      object has moved to a different table, and the bucket head cannot
      move.  The NULLS marker is only needed in a ->next pointer.
      
      As we already have different types for the bucket head pointer (struct
      rhash_lock_head) and the ->next pointers (struct rhash_head), it is
      fairly easy to treat the lsb differently in each.
      
      So: Initialize buckets heads to NULL, and use the lsb for locking.
      When loading the pointer from the bucket head, if it is NULL (ignoring
      the lock big), report as being the expected NULLS marker.
      When storing a value into a bucket head, if it is a NULLS marker,
      store NULL instead.
      
      And convert all places that used bit 1 for locking, to use bit 0.
      
      Fixes: 8f0db018 ("rhashtable: use bit_spin_locks to protect hash bucket.")
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca0b709d
    • N
      rhashtable: replace rht_ptr_locked() with rht_assign_locked() · f4712b46
      NeilBrown 提交于
      The only times rht_ptr_locked() is used, it is to store a new
      value in a bucket-head.  This is the only time it makes sense
      to use it too.  So replace it by a function which does the
      whole task:  Sets the lock bit and assigns to a bucket head.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4712b46
    • N
      rhashtable: move dereference inside rht_ptr() · adc6a3ab
      NeilBrown 提交于
      Rather than dereferencing a pointer to a bucket and then passing the
      result to rht_ptr(), we now pass in the pointer and do the dereference
      in rht_ptr().
      
      This requires that we pass in the tbl and hash as well to support RCU
      checks, and means that the various rht_for_each functions can expect a
      pointer that can be dereferenced without further care.
      
      There are two places where we dereference a bucket pointer
      where there is no testable protection - in each case we know
      that we much have exclusive access without having taken a lock.
      The previous code used rht_dereference() to pretend that holding
      the mutex provided protects, but holding the mutex never provides
      protection for accessing buckets.
      
      So instead introduce rht_ptr_exclusive() that can be used when
      there is known to be exclusive access without holding any locks.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adc6a3ab
    • N
      rhashtable: fix some __rcu annotation errors · e4edbe3c
      NeilBrown 提交于
      With these annotations, the rhashtable now gets no
      warnings when compiled with "C=1" for sparse checking.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4edbe3c
    • G
      rhashtable: use struct_size() in kvzalloc() · c252aa3e
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along with
      memory for some number of elements for that array.  For example:
      
      struct foo {
          int stuff;
          struct boo entry[];
      };
      
      size = sizeof(struct foo) + count * sizeof(struct boo);
      instance = kvzalloc(size, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c252aa3e
  6. 08 4月, 2019 4 次提交
    • N
      rhashtable: add lockdep tracking to bucket bit-spin-locks. · 149212f0
      NeilBrown 提交于
      Native bit_spin_locks are not tracked by lockdep.
      
      The bit_spin_locks used for rhashtable buckets are local
      to the rhashtable implementation, so there is little opportunity
      for the sort of misuse that lockdep might detect.
      However locks are held while a hash function or compare
      function is called, and if one of these took a lock,
      a misbehaviour is possible.
      
      As it is quite easy to add lockdep support this unlikely
      possibility seems to be enough justification.
      
      So create a lockdep class for bucket bit_spin_lock and attach
      through a lockdep_map in each bucket_table.
      
      Without the 'nested' annotation in rhashtable_rehash_one(), lockdep
      correctly reports a possible problem as this lock is taken
      while another bucket lock (in another table) is held.  This
      confirms that the added support works.
      With the correct nested annotation in place, lockdep reports
      no problems.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      149212f0
    • N
      rhashtable: use bit_spin_locks to protect hash bucket. · 8f0db018
      NeilBrown 提交于
      This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
      bucket pointer to lock the hash chain for that bucket.
      
      The benefits of a bit spin_lock are:
       - no need to allocate a separate array of locks.
       - no need to have a configuration option to guide the
         choice of the size of this array
       - locking cost is often a single test-and-set in a cache line
         that will have to be loaded anyway.  When inserting at, or removing
         from, the head of the chain, the unlock is free - writing the new
         address in the bucket head implicitly clears the lock bit.
         For __rhashtable_insert_fast() we ensure this always happens
         when adding a new key.
       - even when lockings costs 2 updates (lock and unlock), they are
         in a cacheline that needs to be read anyway.
      
      The cost of using a bit spin_lock is a little bit of code complexity,
      which I think is quite manageable.
      
      Bit spin_locks are sometimes inappropriate because they are not fair -
      if multiple CPUs repeatedly contend of the same lock, one CPU can
      easily be starved.  This is not a credible situation with rhashtable.
      Multiple CPUs may want to repeatedly add or remove objects, but they
      will typically do so at different buckets, so they will attempt to
      acquire different locks.
      
      As we have more bit-locks than we previously had spinlocks (by at
      least a factor of two) we can expect slightly less contention to
      go with the slightly better cache behavior and reduced memory
      consumption.
      
      To enhance type checking, a new struct is introduced to represent the
        pointer plus lock-bit
      that is stored in the bucket-table.  This is "struct rhash_lock_head"
      and is empty.  A pointer to this needs to be cast to either an
      unsigned lock, or a "struct rhash_head *" to be useful.
      Variables of this type are most often called "bkt".
      
      Previously "pprev" would sometimes point to a bucket, and sometimes a
      ->next pointer in an rhash_head.  As these are now different types,
      pprev is NULL when it would have pointed to the bucket. In that case,
      'blk' is used, together with correct locking protocol.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f0db018
    • N
      rhashtable: allow rht_bucket_var to return NULL. · ff302db9
      NeilBrown 提交于
      Rather than returning a pointer to a static nulls, rht_bucket_var()
      now returns NULL if the bucket doesn't exist.
      This will make the next patch, which stores a bitlock in the
      bucket pointer, somewhat cleaner.
      
      This change involves introducing __rht_bucket_nested() which is
      like rht_bucket_nested(), but doesn't provide the static nulls,
      and changing rht_bucket_nested() to call this and possible
      provide a static nulls - as is still needed for the non-var case.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff302db9
    • N
      rhashtable: use cmpxchg() in nested_table_alloc() · 7a41c294
      NeilBrown 提交于
      nested_table_alloc() relies on the fact that there is
      at most one spinlock allocated for every slot in the top
      level nested table, so it is not possible for two threads
      to try to allocate the same table at the same time.
      
      This assumption is a little fragile (it is not explicit) and is
      unnecessary as cmpxchg() can be used instead.
      
      A future patch will replace the spinlocks by per-bucket bitlocks,
      and then we won't be able to protect the slot pointer with a spinlock.
      
      So replace rcu_assign_pointer() with cmpxchg() - which has equivalent
      barrier properties.
      If it the cmp fails, free the table that was just allocated.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a41c294
  7. 22 3月, 2019 3 次提交
    • N
      rhashtable: rename rht_for_each*continue as *from. · f7ad68bf
      NeilBrown 提交于
      The pattern set by list.h is that for_each..continue()
      iterators start at the next entry after the given one,
      while for_each..from() iterators start at the given
      entry.
      
      The rht_for_each*continue() iterators are documented as though the
      start at the 'next' entry, but actually start at the given entry,
      and they are used expecting that behaviour.
      So fix the documentation and change the names to *from for consistency
      with list.h
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7ad68bf
    • N
      rhashtable: don't hold lock on first table throughout insertion. · 4feb7c7a
      NeilBrown 提交于
      rhashtable_try_insert() currently holds a lock on the bucket in
      the first table, while also locking buckets in subsequent tables.
      This is unnecessary and looks like a hold-over from some earlier
      version of the implementation.
      
      As insert and remove always lock a bucket in each table in turn, and
      as insert only inserts in the final table, there cannot be any races
      that are not covered by simply locking a bucket in each table in turn.
      
      When an insert call reaches that last table it can be sure that there
      is no matchinf entry in any other table as it has searched them all, and
      insertion never happens anywhere but in the last table.  The fact that
      code tests for the existence of future_tbl while holding a lock on
      the relevant bucket ensures that two threads inserting the same key
      will make compatible decisions about which is the "last" table.
      
      This simplifies the code and allows the ->rehash field to be
      discarded.
      
      We still need a way to ensure that a dead bucket_table is never
      re-linked by rhashtable_walk_stop().  This can be achieved by calling
      call_rcu() inside the locked region, and checking with
      rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the
      bucket table is empty and dead.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4feb7c7a
    • H
      rhashtable: Still do rehash when we get EEXIST · 408f13ef
      Herbert Xu 提交于
      As it stands if a shrink is delayed because of an outstanding
      rehash, we will go into a rescheduling loop without ever doing
      the rehash.
      
      This patch fixes this by still carrying out the rehash and then
      rescheduling so that we can shrink after the completion of the
      rehash should it still be necessary.
      
      The return value of EEXIST captures this case and other cases
      (e.g., another thread expanded/rehashed the table at the same
      time) where we should still proceed with the rehash.
      
      Fixes: da20420f ("rhashtable: Add nested tables")
      Reported-by: NJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: NJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      408f13ef
  8. 22 2月, 2019 1 次提交
  9. 04 12月, 2018 1 次提交
    • N
      rhashtable: detect when object movement between tables might have invalidated a lookup · 82208d0d
      NeilBrown 提交于
      Some users of rhashtables might need to move an object from one table
      to another -  this appears to be the reason for the incomplete usage
      of NULLS markers.
      
      To support these, we store a unique NULLS_MARKER at the end of
      each chain, and when a search fails to find a match, we check
      if the NULLS marker found was the expected one.  If not, the search
      may not have examined all objects in the target bucket, so it is
      repeated.
      
      The unique NULLS_MARKER is derived from the address of the
      head of the chain.  As this cannot be derived at load-time the
      static rhnull in rht_bucket_nested() needs to be initialised
      at run time.
      
      Any caller of a lookup function must still be prepared for the
      possibility that the object returned is in a different table - it
      might have been there for some time.
      
      Note that this does NOT provide support for other uses of
      NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing
      the key of an object and re-inserting it in the same table.
      These could only be done safely if new objects were inserted
      at the *start* of a hash chain, and that is not currently the case.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82208d0d
  10. 23 8月, 2018 2 次提交
  11. 21 8月, 2018 1 次提交
  12. 19 7月, 2018 1 次提交
  13. 10 7月, 2018 1 次提交
    • T
      rhashtable: add restart routine in rhashtable_free_and_destroy() · 0026129c
      Taehee Yoo 提交于
      rhashtable_free_and_destroy() cancels re-hash deferred work
      then walks and destroys elements. at this moment, some elements can be
      still in future_tbl. that elements are not destroyed.
      
      test case:
      nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
      all elements of sets before destroying sets and chains.
      But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
      so that splat occurred.
      
      test script:
         %cat test.nft
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		}
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		   }
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
      
         %while :; do nft -f test.nft; done
      
      Splat looks like:
      [  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
      [  200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
      [  200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
      [  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
      [  200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
      [  200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
      [  200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
      [  200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
      [  200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
      [  200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
      [  200.906354] FS:  00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
      [  200.915533] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
      [  200.930353] Call Trace:
      [  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
      [  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
      [  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
      [  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
      [  200.959532]  ? nla_parse+0xab/0x230
      [  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
      [  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
      [  200.975525]  ? debug_show_all_locks+0x290/0x290
      [  200.980363]  ? debug_show_all_locks+0x290/0x290
      [  200.986356]  ? sched_clock_cpu+0x132/0x170
      [  200.990352]  ? find_held_lock+0x39/0x1b0
      [  200.994355]  ? sched_clock_local+0x10d/0x130
      [  200.999531]  ? memset+0x1f/0x40
      
      V2:
       - free all tables requested by Herbert Xu
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0026129c
  14. 03 7月, 2018 1 次提交
  15. 22 6月, 2018 6 次提交
  16. 25 4月, 2018 3 次提交
  17. 01 4月, 2018 1 次提交
  18. 07 3月, 2018 1 次提交
  19. 11 12月, 2017 3 次提交
    • T
      rhashtable: Call library function alloc_bucket_locks · 64e0cd0d
      Tom Herbert 提交于
      To allocate the array of bucket locks for the hash table we now
      call library function alloc_bucket_spinlocks. This function is
      based on the old alloc_bucket_locks in rhashtable and should
      produce the same effect.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64e0cd0d
    • T
      rhashtable: Add rhastable_walk_peek · 2db54b47
      Tom Herbert 提交于
      This function is like rhashtable_walk_next except that it only returns
      the current element in the inter and does not advance the iter.
      
      This patch also creates __rhashtable_walk_find_next. It finds the next
      element in the table when the entry cached in iter is NULL or at the end
      of a slot. __rhashtable_walk_find_next is called from
      rhashtable_walk_next and rhastable_walk_peek.
      
      end_of_table is an added field to the iter structure. This indicates
      that the end of table was reached (walker.tbl being NULL is not a
      sufficient condition for end of table).
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2db54b47
    • T
      rhashtable: Change rhashtable_walk_start to return void · 97a6ec4a
      Tom Herbert 提交于
      Most callers of rhashtable_walk_start don't care about a resize event
      which is indicated by a return value of -EAGAIN. So calls to
      rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
      like this is common:
      
             ret = rhashtable_walk_start(rhiter);
             if (ret && ret != -EAGAIN)
                     goto out;
      
      Since zero and -EAGAIN are the only possible return values from the
      function this check is pointless. The condition never evaluates to true.
      
      This patch changes rhashtable_walk_start to return void. This simplifies
      code for the callers that ignore -EAGAIN. For the few cases where the
      caller cares about the resize event, particularly where the table can be
      walked in mulitple parts for netlink or seq file dump, the function
      rhashtable_walk_start_check has been added that returns -EAGAIN on a
      resize event.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97a6ec4a
  20. 20 9月, 2017 1 次提交