1. 05 3月, 2020 2 次提交
  2. 03 3月, 2020 2 次提交
    • A
      bpf: Introduce pinnable bpf_link abstraction · 70ed506c
      Andrii Nakryiko 提交于
      Introduce bpf_link abstraction, representing an attachment of BPF program to
      a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
      ownership of attached BPF program, reference counting of a link itself, when
      reference from multiple anonymous inodes, as well as ensures that release
      callback will be called from a process context, so that users can safely take
      mutex locks and sleep.
      
      Additionally, with a new abstraction it's now possible to generalize pinning
      of a link object in BPF FS, allowing to explicitly prevent BPF program
      detachment on process exit by pinning it in a BPF FS and let it open from
      independent other process to keep working with it.
      
      Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
      program attachments) into utilizing bpf_link framework, making them pinnable
      in BPF FS. More FD-based bpf_links will be added in follow up patches.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
      70ed506c
    • G
      netdevice: Replace zero-length array with flexible-array member · bb4cf02d
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb4cf02d
  3. 01 3月, 2020 1 次提交
  4. 29 2月, 2020 2 次提交
  5. 28 2月, 2020 7 次提交
    • M
      bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump() · 085c20ca
      Martin KaFai Lau 提交于
      This patch will dump out the bpf_sk_storages of a sk
      if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr.
      
      An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in
      INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump.
      If no map_fd is specified, all bpf_sk_storages of a sk will be dumped.
      
      bpf_sk_storages can be added to the system at runtime.  It is difficult
      to find a proper static value for cb->min_dump_alloc.
      
      This patch learns the nlattr size required to dump the bpf_sk_storages
      of a sk.  If it happens to be the very first nlmsg of a dump and it
      cannot fit the needed bpf_sk_storages,  it will try to expand the
      skb by "pskb_expand_head()".
      
      Instead of expanding it in inet_sk_diag_fill(), it is expanded at a
      sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can
      be used.  In __inet_diag_dump(), it will retry as long as the
      skb is empty and the cb->min_dump_alloc becomes larger than before.
      cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE.  The min_dump_alloc
      is also changed from 'u16' to 'u32' to accommodate a sk that may have
      a few large bpf_sk_storages.
      
      The updated cb->min_dump_alloc will also be used to allocate the skb in
      the next dump.  This logic already exists in netlink_dump().
      
      Here is the sample output of a locally modified 'ss' and it could be made
      more readable by using BTF later:
      [root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989'
      State Recv-Q Send-Q Local Address:Port  Peer Address:PortProcess
      ESTAB 0      0              [::1]:51072        [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      ESTAB 0      0              [::1]:51070        [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      
      [root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989'
      State         Recv-Q         Send-Q                   Local Address:Port                    Peer Address:Port         Process
      ESTAB         0              0                                [::1]:51072                          [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
      ESTAB         0              0                                [::1]:51070                          [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
      085c20ca
    • M
      bpf: INET_DIAG support in bpf_sk_storage · 1ed4d924
      Martin KaFai Lau 提交于
      This patch adds INET_DIAG support to bpf_sk_storage.
      
      1. Although this series adds bpf_sk_storage diag capability to inet sk,
         bpf_sk_storage is in general applicable to all fullsock.  Hence, the
         bpf_sk_storage logic will operate on SK_DIAG_* nlattr.  The caller
         will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
         the argument.
      
      2. The request will be like:
      	INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		......
      
         Considering there could have multiple bpf_sk_storages in a sk,
         instead of reusing INET_DIAG_INFO ("ss -i"),  the user can select
         some specific bpf_sk_storage to dump by specifying an array of
         SK_DIAG_BPF_STORAGE_REQ_MAP_FD.
      
         If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
         INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
         of a sk.
      
      3. The reply will be like:
      	INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		......
      
      4. Unlike other INET_DIAG info of a sk which is pretty static, the size
         required to dump the bpf_sk_storage(s) of a sk is dynamic as the
         system adding more bpf_sk_storage_map.  It is hard to set a static
         min_dump_alloc size.
      
         Hence, this series learns it at the runtime and adjust the
         cb->min_dump_alloc as it iterates all sk(s) of a system.  The
         "unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
         is for this purpose.
      
         The next patch will update the cb->min_dump_alloc as it
         iterates the sk(s).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
      1ed4d924
    • M
      inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->data · 0df6d328
      Martin KaFai Lau 提交于
      The INET_DIAG_REQ_BYTECODE nlattr is currently re-found every time when
      the "dump()" is re-started.
      
      In a latter patch, it will also need to parse the new
      INET_DIAG_REQ_SK_BPF_STORAGES nlattr to learn the map_fds. Thus, this
      patch takes this chance to store the parsed nlattr in cb->data
      during the "start" time of a dump.
      
      By doing this, the "bc" argument also becomes unnecessary
      and is removed.  Also, the two copies of the INET_DIAG_REQ_BYTECODE
      parsing-audit logic between compat/current version can be
      consolidated to one.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230415.1975555-1-kafai@fb.com
      0df6d328
    • M
      inet_diag: Refactor inet_sk_diag_fill(), dump(), and dump_one() · 5682d393
      Martin KaFai Lau 提交于
      In a latter patch, there is a need to update "cb->min_dump_alloc"
      in inet_sk_diag_fill() as it learns the diffierent bpf_sk_storages
      stored in a sk while dumping all sk(s) (e.g. tcp_hashinfo).
      
      The inet_sk_diag_fill() currently does not take the "cb" as an argument.
      One of the reason is inet_sk_diag_fill() is used by both dump_one()
      and dump() (which belong to the "struct inet_diag_handler".  The dump_one()
      interface does not pass the "cb" along.
      
      This patch is to make dump_one() pass a "cb".  The "cb" is created in
      inet_diag_cmd_exact().  The "nlh" and "in_skb" are stored in "cb" as
      the dump() interface does.  The total number of args in
      inet_sk_diag_fill() is also cut from 10 to 7 and
      that helps many callers to pass fewer args.
      
      In particular,
      "struct user_namespace *user_ns", "u32 pid", and "u32 seq"
      can be replaced by accessing "cb->nlh" and "cb->skb".
      
      A similar argument reduction is also made to
      inet_twsk_diag_fill() and inet_req_diag_fill().
      
      inet_csk_diag_dump() and inet_csk_diag_fill() are also removed.
      They are mostly equivalent to inet_sk_diag_fill().  Their repeated
      usages are very limited.  Thus, inet_sk_diag_fill() is directly used
      in those occasions.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230409.1975173-1-kafai@fb.com
      5682d393
    • E
      net/mlx5e: Eswitch, Use per vport tables for mirroring · 96e32687
      Eli Cohen 提交于
      When using port mirroring, we forward the traffic to another table and
      use that table to forward to the mirrored vport. Since the hardware
      loses the values of reg c, and in particular reg c0, we fail the match
      on the input vport which previously existed in reg c0. To overcome this
      situation, we use a set of per vport tables, positioned at the lowest
      priority, and forward traffic to those tables. Since these tables are
      per vport, we can avoid matching on reg c0.
      
      Fixes: c01cfd0f ("net/mlx5: E-Switch, Add match on vport metadata for rule in fast path")
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Reviewed-by: NMark Bloch <markb@mellanox.com>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      96e32687
    • G
      bpf: Replace zero-length array with flexible-array member · d7f10df8
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor
      d7f10df8
    • R
      net: phylink: propagate resolved link config via mac_link_up() · 91a208f2
      Russell King 提交于
      Propagate the resolved link parameters via the mac_link_up() call for
      MACs that do not automatically track their PCS state. We propagate the
      link parameters via function arguments so that inappropriate members
      of struct phylink_link_state can't be accessed, and creating a new
      structure just for this adds needless complexity to the API.
      Tested-by: NAndre Przywara <andre.przywara@arm.com>
      Tested-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
      Tested-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91a208f2
  6. 27 2月, 2020 5 次提交
  7. 26 2月, 2020 1 次提交
  8. 25 2月, 2020 6 次提交
  9. 22 2月, 2020 4 次提交
    • J
      netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports · f66ee041
      Jozsef Kadlecsik 提交于
      In the case of huge hash:* types of sets, due to the single spinlock of
      a set the processing of the whole set under spinlock protection could take
      too long.
      
      There were four places where the whole hash table of the set was processed
      from bucket to bucket under holding the spinlock:
      
      - During resizing a set, the original set was locked to exclude kernel side
        add/del element operations (userspace add/del is excluded by the
        nfnetlink mutex). The original set is actually just read during the
        resize, so the spinlocking is replaced with rcu locking of regions.
        However, thus there can be parallel kernel side add/del of entries.
        In order not to loose those operations a backlog is added and replayed
        after the successful resize.
      - Garbage collection of timed out entries was also protected by the spinlock.
        In order not to lock too long, region locking is introduced and a single
        region is processed in one gc go. Also, the simple timer based gc running
        is replaced with a workqueue based solution. The internal book-keeping
        (number of elements, size of extensions) is moved to region level due to
        the region locking.
      - Adding elements: when the max number of the elements is reached, the gc
        was called to evict the timed out entries. The new approach is that the gc
        is called just for the matching region, assuming that if the region
        (proportionally) seems to be full, then the whole set does. We could scan
        the other regions to check every entry under rcu locking, but for huge
        sets it'd mean a slowdown at adding elements.
      - Listing the set header data: when the set was defined with timeout
        support, the garbage collector was called to clean up timed out entries
        to get the correct element numbers and set size values. Now the set is
        scanned to check non-timed out entries, without actually calling the gc
        for the whole set.
      
      Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe ->
      SOFTIRQ-unsafe lock order issues during working on the patch.
      
      Reported-by: syzbot+4b0e9d4ff3cf117837e5@syzkaller.appspotmail.com
      Reported-by: syzbot+c27b8d5010f45c666ed1@syzkaller.appspotmail.com
      Reported-by: syzbot+68a806795ac89df3aa1c@syzkaller.appspotmail.com
      Fixes: 23c42a40 ("netfilter: ipset: Introduction of new commands and protocol version 7")
      Signed-off-by: NJozsef Kadlecsik <kadlec@netfilter.org>
      f66ee041
    • J
      net, sk_msg: Annotate lockless access to sk_prot on clone · b8e202d1
      Jakub Sitnicki 提交于
      sk_msg and ULP frameworks override protocol callbacks pointer in
      sk->sk_prot, while tcp accesses it locklessly when cloning the listening
      socket, that is with neither sk_lock nor sk_callback_lock held.
      
      Once we enable use of listening sockets with sockmap (and hence sk_msg),
      there will be shared access to sk->sk_prot if socket is getting cloned
      while being inserted/deleted to/from the sockmap from another CPU:
      
      Read side:
      
      tcp_v4_rcv
        sk = __inet_lookup_skb(...)
        tcp_check_req(sk)
          inet_csk(sk)->icsk_af_ops->syn_recv_sock
            tcp_v4_syn_recv_sock
              tcp_create_openreq_child
                inet_csk_clone_lock
                  sk_clone_lock
                    READ_ONCE(sk->sk_prot)
      
      Write side:
      
      sock_map_ops->map_update_elem
        sock_map_update_elem
          sock_map_update_common
            sock_map_link_no_progs
              tcp_bpf_init
                tcp_bpf_update_sk_prot
                  sk_psock_update_proto
                    WRITE_ONCE(sk->sk_prot, ops)
      
      sock_map_ops->map_delete_elem
        sock_map_delete_elem
          __sock_map_delete
           sock_map_unref
             sk_psock_put
               sk_psock_drop
                 sk_psock_restore_proto
                   tcp_update_ulp
                     WRITE_ONCE(sk->sk_prot, proto)
      
      Mark the shared access with READ_ONCE/WRITE_ONCE annotations.
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com
      b8e202d1
    • A
      y2038: remove unused time32 interfaces · 412c53a6
      Arnd Bergmann 提交于
      No users remain, so kill these off before we grow new ones.
      
      Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      412c53a6
    • A
      y2038: remove ktime to/from timespec/timeval conversion · 595abbaf
      Arnd Bergmann 提交于
      A couple of helpers are now obsolete and can be removed, so drivers can no
      longer start using them and instead use y2038-safe interfaces.
      
      Link: http://lkml.kernel.org/r/20200110154232.4104492-2-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      595abbaf
  10. 21 2月, 2020 4 次提交
  11. 19 2月, 2020 4 次提交
  12. 17 2月, 2020 2 次提交
    • P
      KVM: x86: fix missing prototypes · d970a325
      Paolo Bonzini 提交于
      Reported with "make W=1" due to -Wmissing-prototypes.
      Reported-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d970a325
    • F
      netfilter: conntrack: allow insertion of clashing entries · 6a757c07
      Florian Westphal 提交于
      This patch further relaxes the need to drop an skb due to a clash with
      an existing conntrack entry.
      
      Current clash resolution handles the case where the clash occurs between
      two identical entries (distinct nf_conn objects with same tuples), i.e.:
      
                          Original                        Reply
      existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
      clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
      
      ... existing handling will discard the unconfirmed clashing entry and
      makes skb->_nfct point to the existing one.  The skb can then be
      processed normally just as if the clash would not have existed in the
      first place.
      
      For other clashes, the skb needs to be dropped.
      This frequently happens with DNS resolvers that send A and AAAA queries
      back-to-back when NAT rules are present that cause packets to get
      different DNAT transformations applied, for example:
      
      -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
      -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353
      
      In this case the A or AAAA query is dropped which incurs a costly
      delay during name resolution.
      
      This patch also allows this collision type:
                             Original                   Reply
      existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
      clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.7:5353
      
      In this case, clash is in original direction -- the reply direction
      is still unique.
      
      The change makes it so that when the 2nd colliding packet is received,
      the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed
      1 second timeout and is inserted in the reply direction only.
      
      The entry is hidden from 'conntrack -L', it will time out quickly
      and it can be early dropped because it will never progress to the
      ASSURED state.
      
      To avoid special-casing the delete code path to special case
      the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is
      added so hlist_nulls_del() will work.
      
      Example:
      
            CPU A:                               CPU B:
      1.  10.2.3.4:42 -> 10.8.8.8:53 (A)
      2.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
      3.  Apply DNAT, reply changed to 10.0.0.6
      4.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
      5.                                         Apply DNAT, reply changed to 10.0.0.7
      6. confirm/commit to conntrack table, no collisions
      7.                                         commit clashing entry
      
      Reply comes in:
      
      10.2.3.4:42 <- 10.0.0.6:5353 (A)
       -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
      10.2.3.4:42 <- 10.0.0.7:5353 (AAAA)
       -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
          The conntrack entry is deleted from table, as it has the NAT_CLASH
          bit set.
      
      In case of a retransmit from ORIGINAL dir, all further packets will get
      the DNAT transformation to 10.0.0.6.
      
      I tried to come up with other solutions but they all have worse
      problems.
      
      Alternatives considered were:
      1.  Confirm ct entries at allocation time, not in postrouting.
       a. will cause uneccesarry work when the skb that creates the
          conntrack is dropped by ruleset.
       b. in case nat is applied, ct entry would need to be moved in
          the table, which requires another spinlock pair to be taken.
       c. breaks the 'unconfirmed entry is private to cpu' assumption:
          we would need to guard all nfct->ext allocation requests with
          ct->lock spinlock.
      
      2. Make the unconfirmed list a hash table instead of a pcpu list.
         Shares drawback c) of the first alternative.
      
      3. Document this is expected and force users to rearrange their
         ruleset (e.g. by using "-m cluster" instead of "-m statistics").
         nft has the 'jhash' expression which can be used instead of 'numgen'.
      
         Major drawback: doesn't fix what I consider a bug, not very realistic
         and I believe its reasonable to have the existing rulesets to 'just
         work'.
      
      4. Document this is expected and force users to steer problematic
         packets to the same CPU -- this would serialize the "allocate new
         conntrack entry/nat table evaluation/perform nat/confirm entry", so
         no race can occur.  Similar drawback to 3.
      
      Another advantage of this patch compared to 1) and 2) is that there are
      no changes to the hot path; things are handled in the udp tracker and
      the clash resolution path.
      
      Cc: rcu@vger.kernel.org
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6a757c07