1. 04 5月, 2019 2 次提交
  2. 02 5月, 2019 12 次提交
    • A
      ipvs: fix warning on unused variable · ae5e0c77
      Andrea Claudi 提交于
      [ Upstream commit c93a49b9769e435990c82297aa0baa31e1538790 ]
      
      When CONFIG_IP_VS_IPV6 is not defined, build produced this warning:
      
      net/netfilter/ipvs/ip_vs_ctl.c:899:6: warning: unused variable ‘ret’ [-Wunused-variable]
        int ret = 0;
            ^~~
      
      Fix this by moving the declaration of 'ret' in the CONFIG_IP_VS_IPV6
      section in the same function.
      
      While at it, drop its unneeded initialisation.
      
      Fixes: 098e13f5b21d ("ipvs: fix dependency on nf_defrag_ipv6")
      Reported-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NAndrea Claudi <aclaudi@redhat.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ae5e0c77
    • P
      netfilter: nf_tables: bogus EBUSY in helper removal from transaction · ffc1d85e
      Pablo Neira Ayuso 提交于
      [ Upstream commit 8ffcd32f64633926163cdd07a7d295c500a947d1 ]
      
      Proper use counter updates when activating and deactivating the object,
      otherwise, this hits bogus EBUSY error.
      
      Fixes: cd5125d8f518 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
      Reported-by: NLaura Garcia <nevola@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ffc1d85e
    • P
      netfilter: nf_tables: bogus EBUSY when deleting set after flush · e313d5da
      Pablo Neira Ayuso 提交于
      [ Upstream commit 273fe3f1006ea5ebc63d6729e43e8e45e32b256a ]
      
      Set deletion after flush coming in the same batch results in EBUSY. Add
      set use counter to track the number of references to this set from
      rules. We cannot rely on the list of bindings for this since such list
      is still populated from the preparation phase.
      Reported-by: NVáclav Zindulka <vaclav.zindulka@tlapnet.cz>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e313d5da
    • P
      netfilter: nf_tables: fix set double-free in abort path · 25ddad73
      Pablo Neira Ayuso 提交于
      [ Upstream commit 40ba1d9b4d19796afc9b7ece872f5f3e8f5e2c13 ]
      
      The abort path can cause a double-free of an anonymous set.
      Added-and-to-be-aborted rule looks like this:
      
      udp dport { 137, 138 } drop
      
      The to-be-aborted transaction list looks like this:
      
      newset
      newsetelem
      newsetelem
      rule
      
      This gets walked in reverse order, so first pass disables the rule, the
      set elements, then the set.
      
      After synchronize_rcu(), we then destroy those in same order: rule, set
      element, set element, newset.
      
      Problem is that the anonymous set has already been bound to the rule, so
      the rule (lookup expression destructor) already frees the set, when then
      cause use-after-free when trying to delete the elements from this set,
      then try to free the set again when handling the newset expression.
      
      Rule releases the bound set in first place from the abort path, this
      causes the use-after-free on set element removal when undoing the new
      element transactions. To handle this, skip new element transaction if
      set is bound from the abort path.
      
      This is still causes the use-after-free on set element removal.  To
      handle this, remove transaction from the list when the set is already
      bound.
      
      Joint work with Florian Westphal.
      
      Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path")
      Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1325Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      25ddad73
    • P
      netfilter: nft_compat: use .release_ops and remove list of extension · 8906234c
      Pablo Neira Ayuso 提交于
      [ Upstream commit b8e204006340b7aaf32bd2b9806c692f6e0cb38a ]
      
      Add .release_ops, that is called in case of error at a later stage in
      the expression initialization path, ie. .select_ops() has been already
      set up operations and that needs to be undone. This allows us to unwind
      .select_ops from the error path, ie. release the dynamic operations for
      this extension.
      
      Moreover, allocate one single operation instead of recycling them, this
      comes at the cost of consuming a bit more memory per rule, but it
      simplifies the infrastructure.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8906234c
    • F
      netfilter: nft_compat: don't use refcount_inc on newly allocated entry · 4f67e897
      Florian Westphal 提交于
      [ Upstream commit 947e492c0fc2132ae5fca081a9c2952ccaab0404 ]
      
      When I moved the refcount to refcount_t type I missed the fact that
      refcount_inc() will result in use-after-free warning with
      CONFIG_REFCOUNT_FULL=y builds.
      
      The correct fix would be to init the reference count to 1 at allocation
      time, but, unfortunately we cannot do this, as we can't undo that
      in case something else fails later in the batch.
      
      So only solution I see is to special-case the 'new entry' condition
      and replace refcount_inc() with a "delayed" refcount_set(1) in this case,
      as done here.
      
      The .activate callback can be removed to simplify things, we only
      need to make sure that deactivate() decrements/unlinks the entry
      from the list at end of transaction phase (commit or abort).
      
      Fixes: 12c44aba6618 ("netfilter: nft_compat: use refcnt_t type for nft_xt reference count")
      Reported-by: NJordan Glover <Golden_Miller83@protonmail.ch>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4f67e897
    • P
      netfilter: nf_tables: unbind set in rule from commit path · af26f3e2
      Pablo Neira Ayuso 提交于
      Anonymous sets that are bound to rules from the same transaction trigger
      a kernel splat from the abort path due to double set list removal and
      double free.
      
      This patch updates the logic to search for the transaction that is
      responsible for creating the set and disable the set list removal and
      release, given the rule is now responsible for this. Lookup is reverse
      since the transaction that adds the set is likely to be at the tail of
      the list.
      
      Moreover, this patch adds the unbind step to deliver the event from the
      commit path.  This should not be done from the worker thread, since we
      have no guarantees of in-order delivery to the listener.
      
      This patch removes the assumption that both activate and deactivate
      callbacks need to be provided.
      
      Fixes: cd5125d8f518 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
      Reported-by: NMikhail Morfikov <mmorfikov@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      af26f3e2
    • F
      netfilter: nf_tables: warn when expr implements only one of activate/deactivate · 27458b54
      Florian Westphal 提交于
      ->destroy is only allowed to free data, or do other cleanups that do not
      have side effects on other state, such as visibility to other netlink
      requests.
      
      Such things need to be done in ->deactivate.
      As a transaction can fail, we need to make sure we can undo such
      operations, therefore ->activate() has to be provided too.
      
      So print a warning and refuse registration if expr->ops provides
      only one of the two operations.
      
      v2: fix nft_expr_check_ops to not repeat same check twice (Jones Desougi)
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      27458b54
    • F
      netfilter: nft_compat: destroy function must not have side effects · cb2e343d
      Florian Westphal 提交于
      The nft_compat destroy function deletes the nft_xt object from a list.
      This isn't allowed anymore. Destroy functions are called asynchronously,
      i.e. next batch can find the object that has a pending ->destroy()
      invocation:
      
      cpu0                       cpu1
       worker
         ->destroy               for_each_entry()
      	                     if (x == ...
      			        return x->ops;
           list_del(x)
           kfree_rcu(x)
                                 expr->ops->... // ops was free'd
      
      To resolve this, the list_del needs to occur before the transaction
      mutex gets released.  nf_tables has a 'deactivate' hook for this
      purpose, so use that to unlink the object from the list.
      
      Fixes: 0935d5588400 ("netfilter: nf_tables: asynchronous release")
      Reported-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cb2e343d
    • F
      netfilter: nf_tables: split set destruction in deactivate and destroy phase · 3dbba8eb
      Florian Westphal 提交于
      [ Upstream commit cd5125d8f51882279f50506bb9c7e5e89dc9bef3 ]
      
      Splits unbind_set into destroy_set and unbinding operation.
      
      Unbinding removes set from lists (so new transaction would not
      find it anymore) but keeps memory allocated (so packet path continues
      to work).
      
      Rebind function is added to allow unrolling in case transaction
      that wants to remove set is aborted.
      
      Destroy function is added to free the memory, but this could occur
      outside of transaction in the future.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3dbba8eb
    • F
      netfilter: nft_compat: make lists per netns · 7693bae6
      Florian Westphal 提交于
      [ Upstream commit cf52572ebbd7189a1966c2b5fc34b97078cd1dce ]
      
      There are two problems with nft_compat since the netlink config
      plane uses a per-netns mutex:
      
      1. Concurrent add/del accesses to the same list
      2. accesses to a list element after it has been free'd already.
      
      This patch fixes the first problem.
      
      Freeing occurs from a work queue, after transaction mutexes have been
      released, i.e., it still possible for a new transaction (even from
      same net ns) to find the to-be-deleted expression in the list.
      
      The ->destroy functions are not allowed to have any such side effects,
      i.e. the list_del() in the destroy function is not allowed.
      
      This part of the problem is solved in the next patch.
      I tried to make this work by serializing list access via mutex
      and by moving list_del() to a deactivate callback, but
      Taehee spotted following race on this approach:
      
        NET #0                          NET #1
         >select_ops()
         ->init()
                                         ->select_ops()
         ->deactivate()
         ->destroy()
            nft_xt_put()
             kfree_rcu(xt, rcu_head);
                                         ->init() <-- use-after-free occurred.
      
      Unfortunately, we can't increment reference count in
      select_ops(), because we can't undo the refcount increase in
      case a different expression fails in the same batch.
      
      (The destroy hook will only be called in case the expression
       was initialized successfully).
      
      Fixes: f102d66b ("netfilter: nf_tables: use dedicated mutex to guard transactions")
      Reported-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7693bae6
    • F
      netfilter: nft_compat: use refcnt_t type for nft_xt reference count · db99f122
      Florian Westphal 提交于
      [ Upstream commit 12c44aba6618b7f6c437076e5722237190f6cd5f ]
      
      Using standard integer type was fine while all operations on it were
      guarded by the nftnl subsys mutex.
      
      This isn't true anymore:
      1. transactions are guarded only by a pernet mutex, so concurrent
         rule manipulation in different netns is racy
      2. the ->destroy hook runs from a work queue after the transaction
         mutex has been released already.
      
      cpu0                           cpu1 (net 1)        cpu2 (net 2)
       kworker
          nft_compat->destroy        nft_compat->init    nft_compat->init
            if (--nft_xt->ref == 0)   nft_xt->ref++        nft_xt->ref++
      
      Switch to refcount_t.  Doing this however only fixes a minor aspect,
      nft_compat also performs linked-list operations in an unsafe way.
      
      This is addressed in the next two patches.
      
      Fixes: f102d66b ("netfilter: nf_tables: use dedicated mutex to guard transactions")
      Fixes: 0935d5588400 ("netfilter: nf_tables: asynchronous release")
      Reported-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      db99f122
  3. 20 4月, 2019 2 次提交
    • T
      netfilter: nf_flow_table: remove flowtable hook flush routine in netns exit routine · 263ed7e6
      Taehee Yoo 提交于
      [ Upstream commit b7f1a16d29b2e28d3dcbb070511bd703e306281b ]
      
      When device is unregistered, flowtable flush routine is called
      by notifier_call(nf_tables_flowtable_event). and exit callback of
      nftables pernet_operation(nf_tables_exit_net) also has flowtable flush
      routine. but when network namespace is destroyed, both notifier_call
      and pernet_operation are called. hence flowtable flush routine in
      pernet_operation is unnecessary.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 nft add table ip filter
         %ip netns exec vm1 nft add flowtable ip filter w \
      	{ hook ingress priority 0\; devices = { lo }\; }
         %ip netns del vm1
      
      splat looks like:
      [  265.187019] WARNING: CPU: 0 PID: 87 at net/netfilter/core.c:309 nf_hook_entry_head+0xc7/0xf0
      [  265.187112] Modules linked in: nf_flow_table_ipv4 nf_flow_table nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip_tables x_tables
      [  265.187390] CPU: 0 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc3+ #5
      [  265.187453] Workqueue: netns cleanup_net
      [  265.187514] RIP: 0010:nf_hook_entry_head+0xc7/0xf0
      [  265.187546] Code: 8d 81 68 03 00 00 5b c3 89 d0 83 fa 04 48 8d 84 c7 e8 11 00 00 76 81 0f 0b 31 c0 e9 78 ff ff ff 0f 0b 48 83 c4 08 31 c0 5b c3 <0f> 0b 31 c0 e9 65 ff ff ff 0f 0b 31 c0 e9 5c ff ff ff 48 89 0c 24
      [  265.187573] RSP: 0018:ffff88011546f098 EFLAGS: 00010246
      [  265.187624] RAX: ffffffff8d90e135 RBX: 1ffff10022a8de1c RCX: 0000000000000000
      [  265.187645] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff880116298040
      [  265.187645] RBP: ffff88010ea4c1a8 R08: 0000000000000000 R09: 0000000000000000
      [  265.187645] R10: ffff88011546f1d8 R11: ffffed0022c532c1 R12: ffff88010ea4c1d0
      [  265.187645] R13: 0000000000000005 R14: dffffc0000000000 R15: ffff88010ea4c1c4
      [  265.187645] FS:  0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
      [  265.187645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  265.187645] CR2: 00007fdfb8d00000 CR3: 0000000057a16000 CR4: 00000000001006f0
      [  265.187645] Call Trace:
      [  265.187645]  __nf_unregister_net_hook+0xca/0x5d0
      [  265.187645]  ? nf_hook_entries_free.part.3+0x80/0x80
      [  265.187645]  ? save_trace+0x300/0x300
      [  265.187645]  nf_unregister_net_hooks+0x2e/0x40
      [  265.187645]  nf_tables_exit_net+0x479/0x1340 [nf_tables]
      [  265.187645]  ? find_held_lock+0x39/0x1c0
      [  265.187645]  ? nf_tables_abort+0x30/0x30 [nf_tables]
      [  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
      [  265.187645]  ? trace_hardirqs_on+0x93/0x210
      [  265.187645]  ? __bpf_trace_preemptirq_template+0x10/0x10
      [  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
      [  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
      [  265.187645]  ? __mutex_unlock_slowpath+0x17f/0x740
      [  265.187645]  ? wait_for_completion+0x710/0x710
      [  265.187645]  ? bucket_table_free+0xb2/0x1f0
      [  265.187645]  ? nested_table_free+0x130/0x130
      [  265.187645]  ? __lock_is_held+0xb4/0x140
      [  265.187645]  ops_exit_list.isra.10+0x94/0x140
      [  265.187645]  cleanup_net+0x45b/0x900
      [ ... ]
      
      This WARNING means that hook unregisteration is failed because
      all flowtables hooks are already unregistered by notifier_call.
      
      Network namespace exit routine guarantees that all devices will be
      unregistered first. then, other exit callbacks of pernet_operations
      are called. so that removing flowtable flush routine in exit callback of
      pernet_operation(nf_tables_exit_net) doesn't make flowtable leak.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      263ed7e6
    • P
      netfilter: xt_cgroup: shrink size of v2 path · 1f2b1c6a
      Pablo Neira Ayuso 提交于
      [ Upstream commit 0d704967f4a49cc2212350b3e4a8231f8b4283ed ]
      
      cgroup v2 path field is PATH_MAX which is too large, this is placing too
      much pressure on memory allocation for people with many rules doing
      cgroup v1 classid matching, side effects of this are bug reports like:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=200639
      
      This patch registers a new revision that shrinks the cgroup path to 512
      bytes, which is the same approach we follow in similar extensions that
      have a path field.
      
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1f2b1c6a
  4. 17 4月, 2019 2 次提交
  5. 06 4月, 2019 4 次提交
    • F
      netfilter: physdev: relax br_netfilter dependency · 2e6bcc32
      Florian Westphal 提交于
      [ Upstream commit 8e2f311a68494a6677c1724bdcb10bada21af37c ]
      
      Following command:
        iptables -D FORWARD -m physdev ...
      causes connectivity loss in some setups.
      
      Reason is that iptables userspace will probe kernel for the module revision
      of the physdev patch, and physdev has an artificial dependency on
      br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
      is loaded).
      
      This causes the "phydev" module to be loaded, which in turn enables the
      "call-iptables" infrastructure.
      
      bridged packets might then get dropped by the iptables ruleset.
      
      The better fix would be to change the "call-iptables" defaults to 0 and
      enforce explicit setting to 1, but that breaks backwards compatibility.
      
      This does the next best thing: add a request_module call to checkentry.
      This was a stray '-D ... -m physdev' won't activate br_netfilter
      anymore.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2e6bcc32
    • C
      netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm · 3a1ce979
      Chieh-Min Wang 提交于
      [ Upstream commit 13f5251fd17088170c18844534682d9cab5ff5aa ]
      
      For bridge(br_flood) or broadcast/multicast packets, they could clone
      skb with unconfirmed conntrack which break the rule that unconfirmed
      skb->_nfct is never shared.  With nfqueue running on my system, the race
      can be easily reproduced with following warning calltrace:
      
      [13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P        W       4.4.60 #7744
      [13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
      [13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14)
      [13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8)
      [13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0)
      [13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24)
      [13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618)
      [13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc)
      [13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
      [13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
      [13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c)
      [13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
      [13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
      [13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4)
      [13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac)
      [13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170)
      [13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420)
      [13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248)
      [13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0)
      [13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c)
      [13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368)
      [13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44)
      [13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200)
      [13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64)
      [13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34)
      
      The original code just triggered the warning but do nothing. It will
      caused the shared conntrack moves to the dying list and the packet be
      droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).
      
      - Reproduce steps:
      
      +----------------------------+
      |          br0(bridge)       |
      |                            |
      +-+---------+---------+------+
        | eth0|   | eth1|   | eth2|
        |     |   |     |   |     |
        +--+--+   +--+--+   +---+-+
           |         |          |
           |         |          |
        +--+-+     +-+--+    +--+-+
        | PC1|     | PC2|    | PC3|
        +----+     +----+    +----+
      
      iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass
      
      ps: Our nfq userspace program will set mark on packets whose connection
      has already been processed.
      
      PC1 sends broadcast packets simulated by hping3:
      
      hping3 --rand-source --udp 192.168.1.255 -i u100
      
      - Broadcast racing flow chart is as follow:
      
      br_handle_frame
        BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
        // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
        br_handle_frame_finish
          // check if this packet is broadcast
          br_flood_forward
            br_flood
              list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
                maybe_deliver
                  deliver_clone
                    skb = skb_clone(skb)
                    __br_forward
                      BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
                      // queue in our nfq and received by our userspace program
                      // goto __nf_conntrack_confirm with process context on CPU 1
          br_pass_frame_up
            BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
            // goto __nf_conntrack_confirm with softirq context on CPU 0
      
      Because conntrack confirm can happen at both INPUT and POSTROUTING
      stage.  So with NFQUEUE running, skb->_nfct with the same unconfirmed
      conntrack could race on different core.
      
      This patch fixes a repeating kernel splat, now it is only displayed
      once.
      Signed-off-by: NChieh-Min Wang <chiehminw@synology.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3a1ce979
    • F
      netfilter: conntrack: tcp: only close if RST matches exact sequence · ca66f667
      Florian Westphal 提交于
      [ Upstream commit be0502a3f2e94211a8809a09ecbc3a017189b8fb ]
      
      TCP resets cause instant transition from established to closed state
      provided the reset is in-window.  Endpoints that implement RFC 5961
      require resets to match the next expected sequence number.
      RST segments that are in-window (but that do not match RCV.NXT) are
      ignored, and a "challenge ACK" is sent back.
      
      Main problem for conntrack is that its a middlebox, i.e.  whereas an end
      host might have ACK'd SEQ (and would thus accept an RST with this
      sequence number), conntrack might not have seen this ACK (yet).
      
      Therefore we can't simply flag RSTs with non-exact match as invalid.
      
      This updates RST processing as follows:
      
      1. If the connection is in a state other than ESTABLISHED, nothing is
         changed, RST is subject to normal in-window check.
      
      2. If the RSTs sequence number either matches exactly RCV.NXT,
         connection state moves to CLOSE.
      
      3. The same applies if the RST sequence number aligns with a previous
         packet in the same direction.
      
      In all other cases, the connection remains in ESTABLISHED state.
      If the normal-in-window check passes, the timeout will be lowered
      to that of CLOSE.
      
      If the peer sends a challenge ack, connection timeout will be reset.
      
      If the challenge ACK triggers another RST (RST was valid after all),
      this 2nd RST will match expected sequence and conntrack state changes to
      CLOSE.
      
      If no challenge ACK is received, the connection will time out after
      CLOSE seconds (10 seconds by default), just like without this patch.
      
      Packetdrill test case:
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      // Receive a segment.
      0.210 < P. 1:1001(1000) ack 1 win 46
      0.210 > . 1:1(0) ack 1001
      
      // Application writes 1000 bytes.
      0.250 write(4, ..., 1000) = 1000
      0.250 > P. 1:1001(1000) ack 1001
      
      // First reset, old sequence. Conntrack (correctly) considers this
      // invalid due to failed window validation (regardless of this patch).
      0.260 < R  2:2(0) ack 1001 win 260
      
      // 2nd reset, but too far ahead sequence.  Same: correctly handled
      // as invalid.
      0.270 < R 99990001:99990001(0) ack 1001 win 260
      
      // in-window, but not exact sequence.
      // Current Linux kernels might reply with a challenge ack, and do not
      // remove connection.
      // Without this patch, conntrack state moves to CLOSE.
      // With patch, timeout is lowered like CLOSE, but connection stays
      // in ESTABLISHED state.
      0.280 < R 1010:1010(0) ack 1001 win 260
      
      // Expect challenge ACK
      0.281 > . 1001:1001(0) ack 1001 win 501
      
      // With or without this patch, RST will cause connection
      // to move to CLOSE (sequence number matches)
      // 0.282 < R 1001:1001(0) ack 1001 win 260
      
      // ACK
      0.300 < . 1001:1001(0) ack 1001 win 257
      
      // more data could be exchanged here, connection
      // is still established
      
      // Client closes the connection.
      0.610 < F. 1001:1001(0) ack 1001 win 260
      0.650 > . 1001:1001(0) ack 1002
      
      // Close the connection without reading outstanding data
      0.700 close(4) = 0
      
      // so one more reset.  Will be deemed acceptable with patch as well:
      // connection is already closing.
      0.701 > R. 1001:1001(0) ack 1002 win 501
      // End packetdrill test case.
      
      With patch, this generates following conntrack events:
         [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED]
      [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80
      [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      [UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      [UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      
      Without patch, first RST moves connection to close, whereas socket state
      does not change until FIN is received.
         [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED]
      [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80
      [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
      [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
      
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ca66f667
    • L
      netfilter: nf_tables: check the result of dereferencing base_chain->stats · 709aaa09
      Li RongQing 提交于
      [ Upstream commit a9f5e78c403d2d62ade4f4c85040efc85f4049b8 ]
      
      Check the result of dereferencing base_chain->stats, instead of result
      of this_cpu_ptr with NULL.
      
      base_chain->stats maybe be changed to NULL when a chain is updated and a
      new NULL counter can be attached.
      
      And we do not need to check returning of this_cpu_ptr since
      base_chain->stats is from percpu allocator if it is non-NULL,
      this_cpu_ptr returns a valid value.
      
      And fix two sparse error by replacing rcu_access_pointer and
      rcu_dereference with READ_ONCE under rcu_read_lock.
      
      Thanks for Eric's help to finish this patch.
      
      Fixes: 00924094 ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set")
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NZhang Yu <zhangyu31@baidu.com>
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      709aaa09
  6. 24 3月, 2019 2 次提交
    • A
      ipvs: fix dependency on nf_defrag_ipv6 · 5ca2ef67
      Andrea Claudi 提交于
      [ Upstream commit 098e13f5b21d3398065fce8780f07a3ef62f4812 ]
      
      ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation,
      but lacks proper Kconfig dependencies and does not explicitly
      request defrag features.
      
      As a result, if netfilter hooks are not loaded, when IPv6 fragmented
      packet are handled by ipvs only the first fragment makes through.
      
      Fix it properly declaring the dependency on Kconfig and registering
      netfilter hooks on ip_vs_add_service() and ip_vs_new_dest().
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NAndrea Claudi <aclaudi@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5ca2ef67
    • F
      netfilter: compat: initialize all fields in xt_init · e0e6b0d7
      Francesco Ruggeri 提交于
      [ Upstream commit 8d29d16d21342a0c86405d46de0c4ac5daf1760f ]
      
      If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init
      time, the following panic can be caused by running
      
      % ebtables -t broute -F BROUTING
      
      from a 32-bit user level on a 64-bit kernel. This patch replaces
      kmalloc_array with kcalloc when allocating xt.
      
      [  474.680846] BUG: unable to handle kernel paging request at 0000000009600920
      [  474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0
      [  474.693838] Oops: 0000 [#1] SMP
      [  474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1
      [  474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013
      [  474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables]
      [  474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d
      [  474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207
      [  474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249
      [  474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124
      [  474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f
      [  474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001
      [  474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8
      [  474.780234] FS:  0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700
      [  474.788612] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      [  474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0
      [  474.802052] Call Trace:
      [  474.804789]  compat_do_replace+0x1fb/0x2a3 [ebtables]
      [  474.810105]  compat_do_ebt_set_ctl+0x69/0xe6 [ebtables]
      [  474.815605]  ? try_module_get+0x37/0x42
      [  474.819716]  compat_nf_setsockopt+0x4f/0x6d
      [  474.824172]  compat_ip_setsockopt+0x7e/0x8c
      [  474.828641]  compat_raw_setsockopt+0x16/0x3a
      [  474.833220]  compat_sock_common_setsockopt+0x1d/0x24
      [  474.838458]  __compat_sys_setsockopt+0x17e/0x1b1
      [  474.843343]  ? __check_object_size+0x76/0x19a
      [  474.847960]  __ia32_compat_sys_socketcall+0x1cb/0x25b
      [  474.853276]  do_fast_syscall_32+0xaf/0xf6
      [  474.857548]  entry_SYSENTER_compat+0x6b/0x7a
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e0e6b0d7
  7. 14 3月, 2019 4 次提交
    • T
      netfilter: xt_TEE: add missing code to get interface index in checkentry. · e6e00017
      Taehee Yoo 提交于
      [ Upstream commit 18c0ab87364ac5128a152055fdcb1d27e01caf01 ]
      
      checkentry(tee_tg_check) should initialize priv->oif from dev if possible.
      But only netdevice notifier handler can set that.
      Hence priv->oif is always -1 until notifier handler is called.
      
      Fixes: 9e2f6c5d ("netfilter: Rework xt_TEE netdevice notifier")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e6e00017
    • T
      netfilter: xt_TEE: fix wrong interface selection · 02d86085
      Taehee Yoo 提交于
      [ Upstream commit f24d2d4f9586985509320f90308723d3d0c4e47f ]
      
      TEE netdevice notifier handler checks only interface name. however
      each netns can have same interface name. hence other netns's interface
      could be selected.
      
      test commands:
         %ip netns add vm1
         %iptables -I INPUT -p icmp -j TEE --gateway 192.168.1.1 --oif enp2s0
         %ip link set enp2s0 netns vm1
      
      Above rule is in the root netns. but that rule could get enp2s0
      ifindex of vm1 by notifier handler.
      
      After this patch, TEE rule is added to the per-netns list.
      
      Fixes: 9e2f6c5d ("netfilter: Rework xt_TEE netdevice notifier")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      02d86085
    • M
      netfilter: nf_nat: skip nat clash resolution for same-origin entries · 5058447b
      Martynas Pumputis 提交于
      [ Upstream commit 4e35c1cb9460240e983a01745b5f29fe3a4d8e39 ]
      
      It is possible that two concurrent packets originating from the same
      socket of a connection-less protocol (e.g. UDP) can end up having
      different IP_CT_DIR_REPLY tuples which results in one of the packets
      being dropped.
      
      To illustrate this, consider the following simplified scenario:
      
      1. Packet A and B are sent at the same time from two different threads
         by same UDP socket.  No matching conntrack entry exists yet.
         Both packets cause allocation of a new conntrack entry.
      2. get_unique_tuple gets called for A.  No clashing entry found.
         conntrack entry for A is added to main conntrack table.
      3. get_unique_tuple is called for B and will find that the reply
         tuple of B is already taken by A.
         It will allocate a new UDP source port for B to resolve the clash.
      4. conntrack entry for B cannot be added to main conntrack table
         because its ORIGINAL direction is clashing with A and the REPLY
         directions of A and B are not the same anymore due to UDP source
         port reallocation done in step 3.
      
      This patch modifies nf_conntrack_tuple_taken so it doesn't consider
      colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.
      
      [ Florian: simplify patch to not use .allow_clash setting
        and always ignore identical flows ]
      Signed-off-by: NMartynas Pumputis <martynas@weave.works>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5058447b
    • Z
      ipvs: Fix signed integer overflow when setsockopt timeout · e0b03a6b
      ZhangXiaoxu 提交于
      [ Upstream commit 53ab60baa1ac4f20b080a22c13b77b6373922fd7 ]
      
      There is a UBSAN bug report as below:
      UBSAN: Undefined behaviour in net/netfilter/ipvs/ip_vs_ctl.c:2227:21
      signed integer overflow:
      -2147483647 * 1000 cannot be represented in type 'int'
      
      Reproduce program:
      	#include <stdio.h>
      	#include <sys/types.h>
      	#include <sys/socket.h>
      
      	#define IPPROTO_IP 0
      	#define IPPROTO_RAW 255
      
      	#define IP_VS_BASE_CTL		(64+1024+64)
      	#define IP_VS_SO_SET_TIMEOUT	(IP_VS_BASE_CTL+10)
      
      	/* The argument to IP_VS_SO_GET_TIMEOUT */
      	struct ipvs_timeout_t {
      		int tcp_timeout;
      		int tcp_fin_timeout;
      		int udp_timeout;
      	};
      
      	int main() {
      		int ret = -1;
      		int sockfd = -1;
      		struct ipvs_timeout_t to;
      
      		sockfd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW);
      		if (sockfd == -1) {
      			printf("socket init error\n");
      			return -1;
      		}
      
      		to.tcp_timeout = -2147483647;
      		to.tcp_fin_timeout = -2147483647;
      		to.udp_timeout = -2147483647;
      
      		ret = setsockopt(sockfd,
      				 IPPROTO_IP,
      				 IP_VS_SO_SET_TIMEOUT,
      				 (char *)(&to),
      				 sizeof(to));
      
      		printf("setsockopt return %d\n", ret);
      		return ret;
      	}
      
      Return -EINVAL if the timeout value is negative or max than 'INT_MAX / HZ'.
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e0b03a6b
  8. 27 2月, 2019 7 次提交
  9. 26 1月, 2019 1 次提交
    • S
      netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets · ad7013cd
      Stefano Brivio 提交于
      [ Upstream commit 8cc4ccf58379935f3ad456cc34e61c4e4c921d0e ]
      
      There doesn't seem to be any reason to restrict MAC address
      matching to source MAC addresses in set types bitmap:ipmac,
      hash:ipmac and hash:mac. With this patch, and this setup:
      
        ip netns add A
        ip link add veth1 type veth peer name veth2 netns A
        ip addr add 192.0.2.1/24 dev veth1
        ip -net A addr add 192.0.2.2/24 dev veth2
        ip link set veth1 up
        ip -net A link set veth2 up
      
        ip netns exec A ipset create test hash:mac
        dst=$(ip netns exec A cat /sys/class/net/veth2/address)
        ip netns exec A ipset add test ${dst}
        ip netns exec A iptables -P INPUT DROP
        ip netns exec A iptables -I INPUT -m set --match-set test dst -j ACCEPT
      
      ipset will match packets based on destination MAC address:
      
        # ping -c1 192.0.2.2 >/dev/null
        # echo $?
        0
      Reported-by: NYi Chen <yiche@redhat.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ad7013cd
  10. 23 1月, 2019 4 次提交