1. 24 3月, 2019 2 次提交
    • A
      netfilter: ipt_CLUSTERIP: fix warning unused variable cn · 0d98ecb1
      Anders Roxell 提交于
      commit 206b8cc514d7ff2b79dd2d5ad939adc7c493f07a upstream.
      
      When CONFIG_PROC_FS isn't set the variable cn isn't used.
      
      net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’:
      net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable]
        struct clusterip_net *cn = clusterip_pernet(net);
                              ^~
      
      Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS".
      
      Fixes: b12f7bad5ad3 ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d98ecb1
    • M
      esp: Skip TX bytes accounting when sending from a request socket · b92eaed3
      Martin Willi 提交于
      [ Upstream commit 09db51241118aeb06e1c8cd393b45879ce099b36 ]
      
      On ESP output, sk_wmem_alloc is incremented for the added padding if a
      socket is associated to the skb. When replying with TCP SYNACKs over
      IPsec, the associated sk is a casted request socket, only. Increasing
      sk_wmem_alloc on a request socket results in a write at an arbitrary
      struct offset. In the best case, this produces the following WARNING:
      
      WARNING: CPU: 1 PID: 0 at lib/refcount.c:102 esp_output_head+0x2e4/0x308 [esp4]
      refcount_t: addition on 0; use-after-free.
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc3 #2
      Hardware name: Marvell Armada 380/385 (Device Tree)
      [...]
      [<bf0ff354>] (esp_output_head [esp4]) from [<bf1006a4>] (esp_output+0xb8/0x180 [esp4])
      [<bf1006a4>] (esp_output [esp4]) from [<c05dee64>] (xfrm_output_resume+0x558/0x664)
      [<c05dee64>] (xfrm_output_resume) from [<c05d07b0>] (xfrm4_output+0x44/0xc4)
      [<c05d07b0>] (xfrm4_output) from [<c05956bc>] (tcp_v4_send_synack+0xa8/0xe8)
      [<c05956bc>] (tcp_v4_send_synack) from [<c0586ad8>] (tcp_conn_request+0x7f4/0x948)
      [<c0586ad8>] (tcp_conn_request) from [<c058c404>] (tcp_rcv_state_process+0x2a0/0xe64)
      [<c058c404>] (tcp_rcv_state_process) from [<c05958ac>] (tcp_v4_do_rcv+0xf0/0x1f4)
      [<c05958ac>] (tcp_v4_do_rcv) from [<c0598a4c>] (tcp_v4_rcv+0xdb8/0xe20)
      [<c0598a4c>] (tcp_v4_rcv) from [<c056eb74>] (ip_protocol_deliver_rcu+0x2c/0x2dc)
      [<c056eb74>] (ip_protocol_deliver_rcu) from [<c056ee6c>] (ip_local_deliver_finish+0x48/0x54)
      [<c056ee6c>] (ip_local_deliver_finish) from [<c056eecc>] (ip_local_deliver+0x54/0xec)
      [<c056eecc>] (ip_local_deliver) from [<c056efac>] (ip_rcv+0x48/0xb8)
      [<c056efac>] (ip_rcv) from [<c0519c2c>] (__netif_receive_skb_one_core+0x50/0x6c)
      [...]
      
      The issue triggers only when not using TCP syncookies, as for syncookies
      no socket is associated.
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b92eaed3
  2. 19 3月, 2019 5 次提交
  3. 14 3月, 2019 1 次提交
    • S
      vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel · 8ce41db0
      Su Yanjun 提交于
      [ Upstream commit dd9ee3444014e8f28c0eefc9fffc9ac9c5248c12 ]
      
      Recently we run a network test over ipcomp virtual tunnel.We find that
      if a ipv4 packet needs fragment, then the peer can't receive
      it.
      
      We deep into the code and find that when packet need fragment the smaller
      fragment will be encapsulated by ipip not ipcomp. So when the ipip packet
      goes into xfrm, it's skb->dev is not properly set. The ipv4 reassembly code
      always set skb'dev to the last fragment's dev. After ipv4 defrag processing,
      when the kernel rp_filter parameter is set, the skb will be drop by -EXDEV
      error.
      
      This patch adds compatible support for the ipip process in ipcomp virtual tunnel.
      Signed-off-by: NSu Yanjun <suyj.fnst@cn.fujitsu.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8ce41db0
  4. 10 3月, 2019 6 次提交
  5. 27 2月, 2019 2 次提交
    • T
      netfilter: ipt_CLUSTERIP: fix sleep-in-atomic bug in clusterip_config_entry_put() · 6546e115
      Taehee Yoo 提交于
      commit 2a61d8b883bbad26b06d2e6cc3777a697e78830d upstream.
      
      A proc_remove() can sleep. so that it can't be inside of spin_lock.
      Hence proc_remove() is moved to outside of spin_lock. and it also
      adds mutex to sync create and remove of proc entry(config->pde).
      
      test commands:
      SHELL#1
         %while :; do iptables -A INPUT -p udp -i enp2s0 -d 192.168.1.100 \
      	   --dport 9000  -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:21 --total-nodes 3 --local-node 3; \
      	   iptables -F; done
      
      SHELL#2
         %while :; do echo +1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; \
      	   echo -1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; done
      
      [ 2949.569864] BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
      [ 2949.579944] in_atomic(): 1, irqs_disabled(): 0, pid: 5472, name: iptables
      [ 2949.587920] 1 lock held by iptables/5472:
      [ 2949.592711]  #0: 000000008f0ebcf2 (&(&cn->lock)->rlock){+...}, at: refcount_dec_and_lock+0x24/0x50
      [ 2949.603307] CPU: 1 PID: 5472 Comm: iptables Tainted: G        W         4.19.0-rc5+ #16
      [ 2949.604212] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [ 2949.604212] Call Trace:
      [ 2949.604212]  dump_stack+0xc9/0x16b
      [ 2949.604212]  ? show_regs_print_info+0x5/0x5
      [ 2949.604212]  ___might_sleep+0x2eb/0x420
      [ 2949.604212]  ? set_rq_offline.part.87+0x140/0x140
      [ 2949.604212]  ? _rcu_barrier_trace+0x400/0x400
      [ 2949.604212]  wait_for_completion+0x94/0x710
      [ 2949.604212]  ? wait_for_completion_interruptible+0x780/0x780
      [ 2949.604212]  ? __kernel_text_address+0xe/0x30
      [ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
      [ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
      [ 2949.604212]  ? __init_waitqueue_head+0x86/0x130
      [ 2949.604212]  ? init_wait_entry+0x1a0/0x1a0
      [ 2949.604212]  proc_entry_rundown+0x208/0x270
      [ 2949.604212]  ? proc_reg_get_unmapped_area+0x370/0x370
      [ 2949.604212]  ? __lock_acquire+0x4500/0x4500
      [ 2949.604212]  ? complete+0x18/0x70
      [ 2949.604212]  remove_proc_subtree+0x143/0x2a0
      [ 2949.708655]  ? remove_proc_entry+0x390/0x390
      [ 2949.708655]  clusterip_tg_destroy+0x27a/0x630 [ipt_CLUSTERIP]
      [ ... ]
      
      Fixes: b3e456fc ("netfilter: ipt_CLUSTERIP: fix a race condition of proc file creation")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6546e115
    • K
      inet_diag: fix reporting cgroup classid and fallback to priority · 589503cb
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1ec17dbd90f8b638f41ee650558609c1af63dfa0 ]
      
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      589503cb
  6. 23 2月, 2019 4 次提交
  7. 07 2月, 2019 3 次提交
  8. 31 1月, 2019 5 次提交
  9. 26 1月, 2019 3 次提交
    • T
      netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine · 9d51378a
      Taehee Yoo 提交于
      [ Upstream commit 5a86d68bcf02f2d1e9a5897dd482079fd5f75e7f ]
      
      When network namespace is destroyed, cleanup_net() is called.
      cleanup_net() holds pernet_ops_rwsem then calls each ->exit callback.
      So that clusterip_tg_destroy() is called by cleanup_net().
      And clusterip_tg_destroy() calls unregister_netdevice_notifier().
      
      But both cleanup_net() and clusterip_tg_destroy() hold same
      lock(pernet_ops_rwsem). hence deadlock occurrs.
      
      After this patch, only 1 notifier is registered when module is inserted.
      And all of configs are added to per-net list.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.809674] ============================================
      [  341.809674] WARNING: possible recursive locking detected
      [  341.809674] 4.19.0-rc5+ #16 Tainted: G        W
      [  341.809674] --------------------------------------------
      [  341.809674] kworker/u4:2/87 is trying to acquire lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x8c/0x460
      [  341.809674]
      [  341.809674] but task is already holding lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] other info that might help us debug this:
      [  341.809674]  Possible unsafe locking scenario:
      [  341.809674]
      [  341.809674]        CPU0
      [  341.809674]        ----
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]
      [  341.809674]  *** DEADLOCK ***
      [  341.809674]
      [  341.809674]  May be due to missing lock nesting notation
      [  341.809674]
      [  341.809674] 3 locks held by kworker/u4:2/87:
      [  341.809674]  #0: 00000000d9df6c92 ((wq_completion)"%s""netns"){+.+.}, at: process_one_work+0xafe/0x1de0
      [  341.809674]  #1: 00000000c2cbcee2 (net_cleanup_work){+.+.}, at: process_one_work+0xb60/0x1de0
      [  341.809674]  #2: 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] stack backtrace:
      [  341.809674] CPU: 1 PID: 87 Comm: kworker/u4:2 Tainted: G        W         4.19.0-rc5+ #16
      [  341.809674] Workqueue: netns cleanup_net
      [  341.809674] Call Trace:
      [ ... ]
      [  342.070196]  down_write+0x93/0x160
      [  342.070196]  ? unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? down_read+0x1e0/0x1e0
      [  342.070196]  ? sched_clock_cpu+0x126/0x170
      [  342.070196]  ? find_held_lock+0x39/0x1c0
      [  342.070196]  unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? register_netdevice_notifier+0x790/0x790
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.070196]  ? trace_hardirqs_on+0x93/0x210
      [  342.070196]  ? __bpf_trace_preemptirq_template+0x10/0x10
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.123094]  clusterip_tg_destroy+0x3ad/0x650 [ipt_CLUSTERIP]
      [  342.123094]  ? clusterip_net_init+0x3d0/0x3d0 [ipt_CLUSTERIP]
      [  342.123094]  ? cleanup_match+0x17d/0x200 [ip_tables]
      [  342.123094]  ? xt_unregister_table+0x215/0x300 [x_tables]
      [  342.123094]  ? kfree+0xe2/0x2a0
      [  342.123094]  cleanup_entry+0x1d5/0x2f0 [ip_tables]
      [  342.123094]  ? cleanup_match+0x200/0x200 [ip_tables]
      [  342.123094]  __ipt_unregister_table+0x9b/0x1a0 [ip_tables]
      [  342.123094]  iptable_filter_net_exit+0x43/0x80 [iptable_filter]
      [  342.123094]  ops_exit_list.isra.10+0x94/0x140
      [  342.123094]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 202f59af ("netfilter: ipt_CLUSTERIP: do not hold dev")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9d51378a
    • T
      netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine · bb7b6c49
      Taehee Yoo 提交于
      [ Upstream commit b12f7bad5ad3724d19754390a3e80928525c0769 ]
      
      When network namespace is destroyed, both clusterip_tg_destroy() and
      clusterip_net_exit() are called. and clusterip_net_exit() is called
      before clusterip_tg_destroy().
      Hence cleanup check code in clusterip_net_exit() doesn't make sense.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.184508] WARNING: CPU: 1 PID: 87 at net/ipv4/netfilter/ipt_CLUSTERIP.c:840 clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.184850] Modules linked in: ipt_CLUSTERIP nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter ip_tables x_tables
      [  341.184850] CPU: 1 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc5+ #16
      [  341.227509] Workqueue: netns cleanup_net
      [  341.227509] RIP: 0010:clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.227509] Code: 0f 85 7f fe ff ff 48 c7 c2 80 64 2c c0 be a8 02 00 00 48 c7 c7 a0 63 2c c0 c6 05 18 6e 00 00 01 e8 bc 38 ff f5 e9 5b fe ff ff <0f> 0b e9 33 ff ff ff e8 4b 90 50 f6 e9 2d fe ff ff 48 89 df e8 de
      [  341.227509] RSP: 0018:ffff88011086f408 EFLAGS: 00010202
      [  341.227509] RAX: dffffc0000000000 RBX: 1ffff1002210de85 RCX: 0000000000000000
      [  341.227509] RDX: 1ffff1002210de85 RSI: ffff880110813be8 RDI: ffffed002210de58
      [  341.227509] RBP: ffff88011086f4d0 R08: 0000000000000000 R09: 0000000000000000
      [  341.227509] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1002210de81
      [  341.227509] R13: ffff880110625a48 R14: ffff880114cec8c8 R15: 0000000000000014
      [  341.227509] FS:  0000000000000000(0000) GS:ffff880116600000(0000) knlGS:0000000000000000
      [  341.227509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  341.227509] CR2: 00007f11fd38e000 CR3: 000000013ca16000 CR4: 00000000001006e0
      [  341.227509] Call Trace:
      [  341.227509]  ? __clusterip_config_find+0x460/0x460 [ipt_CLUSTERIP]
      [  341.227509]  ? default_device_exit+0x1ca/0x270
      [  341.227509]  ? remove_proc_entry+0x1cd/0x390
      [  341.227509]  ? dev_change_net_namespace+0xd00/0xd00
      [  341.227509]  ? __init_waitqueue_head+0x130/0x130
      [  341.227509]  ops_exit_list.isra.10+0x94/0x140
      [  341.227509]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 613d0776 ("netfilter: exit_net cleanup check added")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bb7b6c49
    • T
      netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set · 744383c8
      Taehee Yoo 提交于
      [ Upstream commit 06aa151ad1fc74a49b45336672515774a678d78d ]
      
      If same destination IP address config is already existing, that config is
      just used. MAC address also should be same.
      However, there is no MAC address checking routine.
      So that MAC address checking routine is added.
      
      test commands:
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1
      
      After this patch, above commands are disallowed.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      744383c8
  10. 23 1月, 2019 2 次提交
  11. 10 1月, 2019 5 次提交
    • E
      tcp: fix a race in inet_diag_dump_icsk() · e5217034
      Eric Dumazet 提交于
      [ Upstream commit f0c928d878e7d01b613c9ae5c971a6b1e473a938 ]
      
      Alexei reported use after frees in inet_diag_dump_icsk() [1]
      
      Because we use refcount_set() when various sockets are setup and
      inserted into ehash, we also need to make sure inet_diag_dump_icsk()
      wont race with the refcount_set() operations.
      
      Jonathan Lemon sent a patch changing net_twsk_hashdance() but
      other spots would need risky changes.
      
      Instead, fix inet_diag_dump_icsk() as this bug came with
      linux-4.10 only.
      
      [1] Quoting Alexei :
      
      First something iterating over sockets finds already freed tw socket:
      
      refcount_t: increment on 0; use-after-free.
      WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
      RIP: 0010:refcount_inc+0x26/0x30
      RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
      RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
      RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
      FS:  00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag]  // sock_hold(sk); in net/ipv4/inet_diag.c:1002
       ? kmalloc_large_node+0x37/0x70
       ? __kmalloc_node_track_caller+0x1cb/0x260
       ? __alloc_skb+0x72/0x1b0
       ? __kmalloc_reserve.isra.40+0x2e/0x80
       __inet_diag_dump+0x3b/0x80 [inet_diag]
       netlink_dump+0x116/0x2a0
       netlink_recvmsg+0x205/0x3c0
       sock_read_iter+0x89/0xd0
       __vfs_read+0xf7/0x140
       vfs_read+0x8a/0x140
       SyS_read+0x3f/0xa0
       do_syscall_64+0x5a/0x100
      
      then a minute later twsk timer fires and hits two bad refcnts
      for this freed socket:
      
      refcount_t: decrement hit 0; leaking memory.
      WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
      Modules linked in:
      RIP: 0010:refcount_dec+0x2e/0x40
      RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
      RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
      RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
      RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
      R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       inet_twsk_kill+0x9d/0xc0  // inet_twsk_bind_unhash(tw, hashinfo);
       call_timer_fn+0x29/0x110
       run_timer_softirq+0x36b/0x3a0
      
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
      RIP: 0010:refcount_sub_and_test+0x46/0x50
      RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
      RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
      RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
      RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
      R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       inet_twsk_put+0x12/0x20  // inet_twsk_put(tw);
       call_timer_fn+0x29/0x110
       run_timer_softirq+0x36b/0x3a0
      
      Fixes: 67db3e4b ("tcp: no longer hold ehash lock while calling tcp_get_info()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5217034
    • M
      net: ipv4: do not handle duplicate fragments as overlapping · d5f9565c
      Michal Kubecek 提交于
      [ Upstream commit ade446403bfb79d3528d56071a84b15351a139ad ]
      
      Since commit 7969e5c4 ("ip: discard IPv4 datagrams with overlapping
      segments.") IPv4 reassembly code drops the whole queue whenever an
      overlapping fragment is received. However, the test is written in a way
      which detects duplicate fragments as overlapping so that in environments
      with many duplicate packets, fragmented packets may be undeliverable.
      
      Add an extra test and for (potentially) duplicate fragment, only drop the
      new fragment rather than the whole queue. Only starting offset and length
      are checked, not the contents of the fragments as that would be too
      expensive. For similar reason, linear list ("run") of a rbtree node is not
      iterated, we only check if the new fragment is a subset of the interval
      covered by existing consecutive fragments.
      
      v2: instead of an exact check iterating through linear list of an rbtree
      node, only check if the new fragment is subset of the "run" (suggested
      by Eric Dumazet)
      
      Fixes: 7969e5c4 ("ip: discard IPv4 datagrams with overlapping segments.")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5f9565c
    • E
      net: clear skb->tstamp in forwarding paths · 281731c8
      Eric Dumazet 提交于
      [ Upstream commit 8203e2d844d34af247a151d8ebd68553a6e91785 ]
      
      Sergey reported that forwarding was no longer working
      if fq packet scheduler was used.
      
      This is caused by the recent switch to EDT model, since incoming
      packets might have been timestamped by __net_timestamp()
      
      __net_timestamp() uses ktime_get_real(), while fq expects packets
      using CLOCK_MONOTONIC base.
      
      The fix is to clear skb->tstamp in forwarding paths.
      
      Fixes: 80b14dee ("net: Add a new socket option for a future transmit time.")
      Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSergey Matyukevich <geomatsi@gmail.com>
      Tested-by: NSergey Matyukevich <geomatsi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      281731c8
    • W
      ip: validate header length on virtual device xmit · cde81154
      Willem de Bruijn 提交于
      [ Upstream commit cb9f1b783850b14cbd7f87d061d784a666dfba1f ]
      
      KMSAN detected read beyond end of buffer in vti and sit devices when
      passing truncated packets with PF_PACKET. The issue affects additional
      ip tunnel devices.
      
      Extend commit 76c0ddd8 ("ip6_tunnel: be careful when accessing the
      inner header") and commit ccfec9e5 ("ip_tunnel: be careful when
      accessing the inner header").
      
      Move the check to a separate helper and call at the start of each
      ndo_start_xmit function in net/ipv4 and net/ipv6.
      
      Minor changes:
      - convert dev_kfree_skb to kfree_skb on error path,
        as dev_kfree_skb calls consume_skb which is not for error paths.
      - use pskb_network_may_pull even though that is pedantic here,
        as the same as pskb_may_pull for devices without llheaders.
      - do not cache ipv6 hdrs if used only once
        (unsafe across pskb_may_pull, was more relevant to earlier patch)
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cde81154
    • G
      ipv4: Fix potential Spectre v1 vulnerability · 360fb1db
      Gustavo A. R. Silva 提交于
      [ Upstream commit 5648451e30a0d13d11796574919a359025d52cce ]
      
      vr.vifi is indirectly controlled by user-space, hence leading to
      a potential exploitation of the Spectre variant 1 vulnerability.
      
      This issue was detected with the help of Smatch:
      
      net/ipv4/ipmr.c:1616 ipmr_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)
      net/ipv4/ipmr.c:1690 ipmr_compat_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)
      
      Fix this by sanitizing vr.vifi before using it to index mrt->vif_table'
      
      Notice that given that speculation windows are large, the policy is
      to kill the speculation on the first load and not worry if it can be
      completed with a dependent load/store [1].
      
      [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      360fb1db
  12. 17 12月, 2018 2 次提交
    • E
      tcp: lack of available data can also cause TSO defer · d265655a
      Eric Dumazet 提交于
      commit f9bfe4e6a9d08d405fe7b081ee9a13e649c97ecf upstream.
      
      tcp_tso_should_defer() can return true in three different cases :
      
       1) We are cwnd-limited
       2) We are rwnd-limited
       3) We are application limited.
      
      Neal pointed out that my recent fix went too far, since
      it assumed that if we were not in 1) case, we must be rwnd-limited
      
      Fix this by properly populating the is_cwnd_limited and
      is_rwnd_limited booleans.
      
      After this change, we can finally move the silly check for FIN
      flag only for the application-limited case.
      
      The same move for EOR bit will be handled in net-next,
      since commit 1c09f7d073b1 ("tcp: do not try to defer skbs
      with eor mark (MSG_EOR)") is scheduled for linux-4.21
      
      Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100
      and checking none of them was rwnd_limited in the chrono_stat
      output from "ss -ti" command.
      
      Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d265655a
    • T
      netfilter: nat: fix double register in masquerade modules · 5517d4c6
      Taehee Yoo 提交于
      [ Upstream commit 095faf45e64be00bff4da2d6182dface3d69c9b7 ]
      
      There is a reference counter to ensure that masquerade modules register
      notifiers only once. However, the existing reference counter approach is
      not safe, test commands are:
      
         while :
         do
         	   modprobe ip6t_MASQUERADE &
      	   modprobe nft_masq_ipv6 &
      	   modprobe -rv ip6t_MASQUERADE &
      	   modprobe -rv nft_masq_ipv6 &
         done
      
      numbers below represent the reference counter.
      --------------------------------------------------------
      CPU0        CPU1        CPU2        CPU3        CPU4
      [insmod]    [insmod]    [rmmod]     [rmmod]     [insmod]
      --------------------------------------------------------
      0->1
      register    1->2
                  returns     2->1
      			returns     1->0
                                                      0->1
                                                      register <--
                                          unregister
      --------------------------------------------------------
      
      The unregistation of CPU3 should be processed before the
      registration of CPU4.
      
      In order to fix this, use a mutex instead of reference counter.
      
      splat looks like:
      [  323.869557] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1381]
      [  323.869574] Modules linked in: nf_tables(+) nf_nat_ipv6(-) nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 n]
      [  323.869574] irq event stamp: 194074
      [  323.898930] hardirqs last  enabled at (194073): [<ffffffff90004a0d>] trace_hardirqs_on_thunk+0x1a/0x1c
      [  323.898930] hardirqs last disabled at (194074): [<ffffffff90004a29>] trace_hardirqs_off_thunk+0x1a/0x1c
      [  323.898930] softirqs last  enabled at (182132): [<ffffffff922006ec>] __do_softirq+0x6ec/0xa3b
      [  323.898930] softirqs last disabled at (182109): [<ffffffff90193426>] irq_exit+0x1a6/0x1e0
      [  323.898930] CPU: 0 PID: 1381 Comm: modprobe Not tainted 4.20.0-rc2+ #27
      [  323.898930] RIP: 0010:raw_notifier_chain_register+0xea/0x240
      [  323.898930] Code: 3c 03 0f 8e f2 00 00 00 44 3b 6b 10 7f 4d 49 bc 00 00 00 00 00 fc ff df eb 22 48 8d 7b 10 488
      [  323.898930] RSP: 0018:ffff888101597218 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
      [  323.898930] RAX: 0000000000000000 RBX: ffffffffc04361c0 RCX: 0000000000000000
      [  323.898930] RDX: 1ffffffff26132ae RSI: ffffffffc04aa3c0 RDI: ffffffffc04361d0
      [  323.898930] RBP: ffffffffc04361c8 R08: 0000000000000000 R09: 0000000000000001
      [  323.898930] R10: ffff8881015972b0 R11: fffffbfff26132c4 R12: dffffc0000000000
      [  323.898930] R13: 0000000000000000 R14: 1ffff110202b2e44 R15: ffffffffc04aa3c0
      [  323.898930] FS:  00007f813ed41540(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
      [  323.898930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  323.898930] CR2: 0000559bf2c9f120 CR3: 000000010bc80000 CR4: 00000000001006f0
      [  323.898930] Call Trace:
      [  323.898930]  ? atomic_notifier_chain_register+0x2d0/0x2d0
      [  323.898930]  ? down_read+0x150/0x150
      [  323.898930]  ? sched_clock_cpu+0x126/0x170
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  register_netdevice_notifier+0xbb/0x790
      [  323.898930]  ? __dev_close_many+0x2d0/0x2d0
      [  323.898930]  ? __mutex_unlock_slowpath+0x17f/0x740
      [  323.898930]  ? wait_for_completion+0x710/0x710
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  ? up_write+0x6c/0x210
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  324.127073]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  324.127073]  nft_chain_filter_init+0x1e/0xe8a [nf_tables]
      [  324.127073]  nf_tables_module_init+0x37/0x92 [nf_tables]
      [ ... ]
      
      Fixes: 8dd33cc9 ("netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables")
      Fixes: be6b635c ("netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5517d4c6