1. 19 9月, 2017 1 次提交
  2. 17 9月, 2017 1 次提交
  3. 16 9月, 2017 4 次提交
    • X
      sctp: do not mark sk dumped when inet_sctp_diag_fill returns err · 8c7c19a5
      Xin Long 提交于
      sctp_diag would not actually dump out sk/asoc if inet_sctp_diag_fill
      returns err, in which case it shouldn't mark sk dumped by setting
      cb->args[3] as 1 in sctp_sock_dump().
      
      Otherwise, it could cause some asocs to have no parent's sk dumped
      in 'ss --sctp'.
      
      So this patch is to not set cb->args[3] when inet_sctp_diag_fill()
      returns err in sctp_sock_dump().
      
      Fixes: 8f840e47 ("sctp: add the sctp_diag.c file")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c7c19a5
    • X
      sctp: fix an use-after-free issue in sctp_sock_dump · d25adbeb
      Xin Long 提交于
      Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the
      dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
      with holding sock and sock lock.
      
      But Paolo noticed that endpoint could be destroyed in sctp_rcv without
      sock lock protection. It means the use-after-free issue still could be
      triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
      !ep, although it's pretty hard to reproduce.
      
      I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
      and sctp_sock_dump long time.
      
      This patch is to add another param cb_done to sctp_for_each_transport
      and dump ep->assocs with holding tsp after jumping out of transport's
      traversal in it to avoid this issue.
      
      It can also improve sctp diag dump to make it run faster, as no need
      to save sk into cb->args[5] and keep calling sctp_for_each_transport
      any more.
      
      This patch is also to use int * instead of int for the pos argument
      in sctp_for_each_transport, which could make postion increment only
      in sctp_for_each_transport and no need to keep changing cb->args[2]
      in sctp_sock_filter and sctp_sock_dump any more.
      
      Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump")
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d25adbeb
    • E
      tcp: update skb->skb_mstamp more carefully · 8c72c65b
      Eric Dumazet 提交于
      liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
      in tcp_probe_timer() :
            https://www.spinics.net/lists/netdev/msg454496.html
      
      After investigations, the root cause of the problem is that we update
      skb->skb_mstamp of skbs in write queue, even if the attempt to send a
      clone or copy of it failed. One reason being a routing problem.
      
      This patch prevents this, solving liujian issue.
      
      It also removes a potential RTT miscalculation, since
      __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
      TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
      been changed.
      
      A future ACK would then lead to a very small RTT sample and min_rtt
      would then be lowered to this too small value.
      
      Tested:
      
      # cat user_timeout.pkt
      --local_ip=192.168.102.64
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0`
      
         +0 < S 0:0(0) win 0 <mss 1460>
         +0 > S. 0:0(0) ack 1 <mss 1460>
      
        +.1 < . 1:1(0) ack 1 win 65530
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
         +0 write(4, ..., 24) = 24
         +0 > P. 1:25(24) ack 1 win 29200
         +.1 < . 1:1(0) ack 25 win 65530
      
      //change the ipaddress
         +1 `ifconfig tun0 192.168.0.10/16`
      
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
      
         +0 `ifconfig tun0 192.168.102.64/16`
         +0 < . 1:2(1) ack 25 win 65530
         +0 `ifconfig tun0 192.168.0.10/16`
      
         +3 write(4, ..., 24) = -1
      
      # ./packetdrill user_timeout.pkt
      Signed-off-by: NEric Dumazet <edumazet@googl.com>
      Reported-by: Nliujian <liujian56@huawei.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c72c65b
    • D
      net: ipv4: fix l3slave check for index returned in IP_PKTINFO · cbea8f02
      David Ahern 提交于
      rt_iif is only set to the actual egress device for the output path. The
      recent change to consider the l3slave flag when returning IP_PKTINFO
      works for local traffic (the correct device index is returned), but it
      broke the more typical use case of packets received from a remote host
      always returning the VRF index rather than the original ingress device.
      Update the fixup to consider l3slave and rt_iif actually getting set.
      
      Fixes: 1dfa7639 ("net: ipv4: add check for l3slave for index returned in IP_PKTINFO")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbea8f02
  4. 15 9月, 2017 1 次提交
  5. 14 9月, 2017 2 次提交
    • E
      net_sched: gen_estimator: fix scaling error in bytes/packets samples · ca558e18
      Eric Dumazet 提交于
      Denys reported wrong rate estimations with HTB classes.
      
      It appears the bug was added in linux-4.10, since my tests
      where using intervals of one second only.
      
      HTB using 4 sec default rate estimators, reported rates
      were 4x higher.
      
      We need to properly scale the bytes/packets samples before
      integrating them in EWMA.
      
      Tested:
       echo 1 >/sys/module/sch_htb/parameters/htb_rate_est
      
       Setup HTB with one class with a rate/cail of 5Gbit
      
       Generate traffic on this class
      
       tc -s -d cl sh dev eth0 classid 7002:11
      class htb 7002:11 parent 7002:1 prio 5 quantum 200000 rate 5Gbit ceil
      5Gbit linklayer ethernet burst 80000b/1 mpu 0b cburst 80000b/1 mpu 0b
      level 0 rate_handle 1
       Sent 1488215421648 bytes 982969243 pkt (dropped 0, overlimits 0
      requeues 0)
       rate 5Gbit 412814pps backlog 136260b 2p requeues 0
       TCP pkts/rtx 982969327/45 bytes 1488215557414/68130
       lended: 22732826 borrowed: 0 giants: 0
       tokens: -1684 ctokens: -1684
      
      Fixes: 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDenys Fedoryshchenko <nuclearcat@nuclearcat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca558e18
    • J
      net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker · 255cd50f
      Jiri Pirko 提交于
      Recent commit d7fb60b9 ("net_sched: get rid of tcfa_rcu") removed
      freeing in call_rcu, which changed already existing hard-to-hit
      race condition into 100% hit:
      
      [  598.599825] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
      [  598.607782] IP: tcf_action_destroy+0xc0/0x140
      
      Or:
      
      [   40.858924] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
      [   40.862840] IP: tcf_generic_walker+0x534/0x820
      
      Fix this by storing the ops and use them directly for module_put call.
      
      Fixes: a85a970a ("net_sched: move tc_action into tcf_common")
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      255cd50f
  6. 13 9月, 2017 7 次提交
    • H
      ip6_tunnel: fix ip6 tunnel lookup in collect_md mode · 6c1cb439
      Haishuang Yan 提交于
      In collect_md mode, if the tun dev is down, it still can call
      __ip6_tnl_rcv to receive on packets, and the rx statistics increase
      improperly.
      
      When the md tunnel is down, it's not neccessary to increase RX drops
      for the tunnel device, packets would be recieved on fallback tunnel,
      and the RX drops on fallback device will be increased as expected.
      
      Fixes: 8d79266b ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c1cb439
    • H
      ip_tunnel: fix ip tunnel lookup in collect_md mode · 833a8b40
      Haishuang Yan 提交于
      In collect_md mode, if the tun dev is down, it still can call
      ip_tunnel_rcv to receive on packets, and the rx statistics increase
      improperly.
      
      When the md tunnel is down, it's not neccessary to increase RX drops
      for the tunnel device, packets would be recieved on fallback tunnel,
      and the RX drops on fallback device will be increased as expected.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      833a8b40
    • C
      net_sched: carefully handle tcf_block_put() · 1697c4bb
      Cong Wang 提交于
      As pointed out by Jiri, there is still a race condition between
      tcf_block_put() and tcf_chain_destroy() in a RCU callback. There
      is no way to make it correct without proper locking or synchronization,
      because both operate on a shared list.
      
      Locking is hard, because the only lock we can pick here is a spinlock,
      however, in tc_dump_tfilter() we iterate this list with a sleeping
      function called (tcf_chain_dump()), which makes using a lock to protect
      chain_list almost impossible.
      
      Jiri suggested the idea of holding a refcnt before flushing, this works
      because it guarantees us there would be no parallel tcf_chain_destroy()
      during the loop, therefore the race condition is gone. But we have to
      be very careful with proper synchronization with RCU callbacks.
      Suggested-by: NJiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1697c4bb
    • C
      net_sched: fix reference counting of tc filter chain · e2ef7544
      Cong Wang 提交于
      This patch fixes the following ugliness of tc filter chain refcnt:
      
      a) tp proto should hold a refcnt to the chain too. This significantly
         simplifies the logic.
      
      b) Chain 0 is no longer special, it is created with refcnt=1 like any
         other chains. All the ugliness in tcf_chain_put() can be gone!
      
      c) No need to handle the flushing oddly, because block still holds
         chain 0, it can not be released, this guarantees block is the last
         user.
      
      d) The race condition with RCU callbacks is easier to handle with just
         a rcu_barrier(). Much easier to understand, nothing to hide. Thanks
         to the previous patch. Please see also the comments in code.
      
      e) Make the code understandable by humans, much less error-prone.
      
      Fixes: 744a4cf6 ("net: sched: fix use after free when tcf_chain_destroy is called multiple times")
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ef7544
    • C
      net_sched: get rid of tcfa_rcu · d7fb60b9
      Cong Wang 提交于
      gen estimator has been rewritten in commit 1c0d32fd
      ("net_sched: gen_estimator: complete rewrite of rate estimators"),
      the caller is no longer needed to wait for a grace period.
      So this patch gets rid of it.
      
      This also completely closes a race condition between action free
      path and filter chain add/remove path for the following patch.
      Because otherwise the nested RCU callback can't be caught by
      rcu_barrier().
      
      Please see also the comments in code.
      
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7fb60b9
    • E
      tcp/dccp: remove reqsk_put() from inet_child_forget() · da8ab578
      Eric Dumazet 提交于
      Back in linux-4.4, I inadvertently put a call to reqsk_put() in
      inet_child_forget(), forgetting it could be called from two different
      points.
      
      In the case it is called from inet_csk_reqsk_queue_add(), we want to
      keep the reference on the request socket, since it is released later by
      the caller (tcp_v{4|6}_rcv())
      
      This bug never showed up because atomic_dec_and_test() was not signaling
      the underflow, and SLAB_DESTROY_BY RCU semantic for request sockets
      prevented the request to be put in quarantine.
      
      Recent conversion of socket refcount from atomic_t to refcount_t finally
      exposed the bug.
      
      So move the reqsk_put() to inet_csk_listen_stop() to fix this.
      
      Thanks to Shankara Pailoor for using syzkaller and providing
      a nice set of .config and C repro.
      
      WARNING: CPU: 2 PID: 4277 at lib/refcount.c:186
      refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 2 PID: 4277 Comm: syz-executor0 Not tainted 4.13.0-rc7 #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0xf7/0x1aa lib/dump_stack.c:52
       panic+0x1ae/0x3a7 kernel/panic.c:180
       __warn+0x1c4/0x1d9 kernel/panic.c:541
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
       do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
       do_error_trap+0x118/0x340 arch/x86/kernel/traps.c:310
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:846
      RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      RSP: 0018:ffff88006e006b60 EFLAGS: 00010286
      RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000026 RSI: 1ffff1000dc00d2c RDI: ffffed000dc00d60
      RBP: ffff88006e006bf0 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000dc00d6d
      R13: 00000000ffffffff R14: 0000000000000001 R15: ffff88006ce9d340
       refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
       reqsk_put+0x71/0x2b0 include/net/request_sock.h:123
       tcp_v4_rcv+0x259e/0x2e20 net/ipv4/tcp_ipv4.c:1729
       ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_local_deliver+0x1ce/0x6d0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:477 [inline]
       ip_rcv_finish+0x8db/0x19c0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_rcv+0xc3f/0x17d0 net/ipv4/ip_input.c:488
       __netif_receive_skb_core+0x1fb7/0x31f0 net/core/dev.c:4298
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4336
       process_backlog+0x1c5/0x6d0 net/core/dev.c:5102
       napi_poll net/core/dev.c:5499 [inline]
       net_rx_action+0x6d3/0x14a0 net/core/dev.c:5565
       __do_softirq+0x2cb/0xb2d kernel/softirq.c:284
       do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:898
       </IRQ>
       do_softirq.part.16+0x63/0x80 kernel/softirq.c:328
       do_softirq kernel/softirq.c:176 [inline]
       __local_bh_enable_ip+0x84/0x90 kernel/softirq.c:181
       local_bh_enable include/linux/bottom_half.h:31 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:705 [inline]
       ip_finish_output2+0x8ad/0x1360 net/ipv4/ip_output.c:231
       ip_finish_output+0x74e/0xb80 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:237 [inline]
       ip_output+0x1cc/0x850 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:471 [inline]
       ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x8c6/0x1810 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1963/0x3320 net/ipv4/tcp_output.c:1123
       tcp_send_ack.part.35+0x38c/0x620 net/ipv4/tcp_output.c:3575
       tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3545
       tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:5795 [inline]
       tcp_rcv_state_process+0x4876/0x4b60 net/ipv4/tcp_input.c:5930
       tcp_v4_do_rcv+0x58a/0x820 net/ipv4/tcp_ipv4.c:1483
       sk_backlog_rcv include/net/sock.h:907 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2223
       release_sock+0xa4/0x2a0 net/core/sock.c:2715
       inet_wait_for_connect net/ipv4/af_inet.c:557 [inline]
       __inet_stream_connect+0x671/0xf00 net/ipv4/af_inet.c:643
       inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:682
       SYSC_connect+0x204/0x470 net/socket.c:1628
       SyS_connect+0x24/0x30 net/socket.c:1609
       entry_SYSCALL_64_fastpath+0x18/0xad
      RIP: 0033:0x451e59
      RSP: 002b:00007f474843fc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 0000000000451e59
      RDX: 0000000000000010 RSI: 0000000020002000 RDI: 0000000000000007
      RBP: 0000000000000046 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000
      R13: 00007ffc040a0f8f R14: 00007f47484409c0 R15: 0000000000000000
      
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NShankara Pailoor <sp3485@columbia.edu>
      Tested-by: NShankara Pailoor <sp3485@columbia.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da8ab578
    • C
      openvswitch: Fix an error handling path in 'ovs_nla_init_match_and_action()' · 5829e62a
      Christophe JAILLET 提交于
      All other error handling paths in this function go through the 'error'
      label. This one should do the same.
      
      Fixes: 9cc9a5cb ("datapath: Avoid using stack larger than 1024.")
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5829e62a
  7. 12 9月, 2017 3 次提交
  8. 10 9月, 2017 1 次提交
  9. 09 9月, 2017 16 次提交
    • D
      bpf: make error reporting in bpf_warn_invalid_xdp_action more clear · 9beb8bed
      Daniel Borkmann 提交于
      Differ between illegal XDP action code and just driver
      unsupported one to provide better feedback when we throw
      a one-time warning here. Reason is that with 814abfab
      ("xdp: add bpf_redirect helper function") not all drivers
      support the new XDP return code yet and thus they will
      fall into their 'default' case when checking for return
      codes after program return, which then triggers a
      bpf_warn_invalid_xdp_action() stating that the return
      code is illegal, but from XDP perspective it's not.
      
      I decided not to place something like a XDP_ACT_MAX define
      into uapi i) given we don't have this either for all other
      program types, ii) future action codes could have further
      encoding there, which would render such define unsuitable
      and we wouldn't be able to rip it out again, and iii) we
      rarely add new action codes.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9beb8bed
    • J
      net: rcu lock and preempt disable missing around generic xdp · bbbe211c
      John Fastabend 提交于
      do_xdp_generic must be called inside rcu critical section with preempt
      disabled to ensure BPF programs are valid and per-cpu variables used
      for redirect operations are consistent. This patch ensures this is true
      and fixes the splat below.
      
      The netif_receive_skb_internal() code path is now broken into two rcu
      critical sections. I decided it was better to limit the preempt_enable/disable
      block to just the xdp static key portion and the fallout is more
      rcu_read_lock/unlock calls. Seems like the best option to me.
      
      [  607.596901] =============================
      [  607.596906] WARNING: suspicious RCU usage
      [  607.596912] 4.13.0-rc4+ #570 Not tainted
      [  607.596917] -----------------------------
      [  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
      [  607.596927]
      [  607.596927] other info that might help us debug this:
      [  607.596927]
      [  607.596933]
      [  607.596933] rcu_scheduler_active = 2, debug_locks = 1
      [  607.596938] 2 locks held by pool/14624:
      [  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
      [  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
      [  607.597000]
      [  607.597000] stack backtrace:
      [  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
      [  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
      [  607.597016] Call Trace:
      [  607.597027]  dump_stack+0x67/0x92
      [  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
      [  607.597054]  do_xdp_generic+0x313/0xa50
      [  607.597068]  ? time_hardirqs_on+0x5b/0x150
      [  607.597076]  ? mark_held_locks+0x6b/0xc0
      [  607.597088]  ? netdev_pick_tx+0x150/0x150
      [  607.597117]  netif_rx_internal+0x205/0x3f0
      [  607.597127]  ? do_xdp_generic+0xa50/0xa50
      [  607.597144]  ? lock_downgrade+0x2b0/0x2b0
      [  607.597158]  ? __lock_is_held+0x93/0x100
      [  607.597187]  netif_rx+0x119/0x190
      [  607.597202]  loopback_xmit+0xfd/0x1b0
      [  607.597214]  dev_hard_start_xmit+0x127/0x4e0
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Fixes: b5cdae32 ("net: Generic XDP")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbbe211c
    • D
      bpf: don't select potentially stale ri->map from buggy xdp progs · 109980b8
      Daniel Borkmann 提交于
      We can potentially run into a couple of issues with the XDP
      bpf_redirect_map() helper. The ri->map in the per CPU storage
      can become stale in several ways, mostly due to misuse, where
      we can then trigger a use after free on the map:
      
      i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT
      and running on a driver not supporting XDP_REDIRECT yet. The
      ri->map on that CPU becomes stale when the XDP program is unloaded
      on the driver, and a prog B loaded on a different driver which
      supports XDP_REDIRECT return code. prog B would have to omit
      calling to bpf_redirect_map() and just return XDP_REDIRECT, which
      would then access the freed map in xdp_do_redirect() since not
      cleared for that CPU.
      
      ii) prog A is calling bpf_redirect_map(), returning a code other
      than XDP_REDIRECT. prog A is then detached, which triggers release
      of the map. prog B is attached which, similarly as in i), would
      just return XDP_REDIRECT without having called bpf_redirect_map()
      and thus be accessing the freed map in xdp_do_redirect() since
      not cleared for that CPU.
      
      iii) prog A is attached to generic XDP, calling the bpf_redirect_map()
      helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is
      currently not handling ri->map (will be fixed by Jesper), so it's
      not being reset. Later loading a e.g. native prog B which would,
      say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would
      find in xdp_do_redirect() that a map was set and uses that causing
      use after free on map access.
      
      Fix thus needs to avoid accessing stale ri->map pointers, naive
      way would be to call a BPF function from drivers that just resets
      it to NULL for all XDP return codes but XDP_REDIRECT and including
      XDP_REDIRECT for drivers not supporting it yet (and let ri->map
      being handled in xdp_do_generic_redirect()). There is a less
      intrusive way w/o letting drivers call a reset for each BPF run.
      
      The verifier knows we're calling into bpf_xdp_redirect_map()
      helper, so it can do a small insn rewrite transparent to the prog
      itself in the sense that it fills R4 with a pointer to the own
      bpf_prog. We have that pointer at verification time anyway and
      R4 is allowed to be used as per calling convention we scratch
      R0 to R5 anyway, so they become inaccessible and program cannot
      read them prior to a write. Then, the helper would store the prog
      pointer in the current CPUs struct redirect_info. Later in
      xdp_do_*_redirect() we check whether the redirect_info's prog
      pointer is the same as passed xdp_prog pointer, and if that's
      the case then all good, since the prog holds a ref on the map
      anyway, so it is always valid at that point in time and must
      have a reference count of at least 1. If in the unlikely case
      they are not equal, it means we got a stale pointer, so we clear
      and bail out right there. Also do reset map and the owning prog
      in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and
      bpf_xdp_redirect() won't get mixed up, only the last call should
      take precedence. A tc bpf_redirect() doesn't use map anywhere
      yet, so no need to clear it there since never accessed in that
      layer.
      
      Note that in case the prog is released, and thus the map as
      well we're still under RCU read critical section at that time
      and have preemption disabled as well. Once we commit with the
      __dev_map_insert_ctx() from xdp_do_redirect_map() and set the
      map to ri->map_to_flush, we still wait for a xdp_do_flush_map()
      to finish in devmap dismantle time once flush_needed bit is set,
      so that is fine.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      109980b8
    • H
      ip6_tunnel: fix setting hop_limit value for ipv6 tunnel · 18e1173d
      Haishuang Yan 提交于
      Similar to vxlan/geneve tunnel, if hop_limit is zero, it should fall
      back to ip6_dst_hoplimt().
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18e1173d
    • H
      ip_tunnel: fix setting ttl and tos value in collect_md mode · 0f693f19
      Haishuang Yan 提交于
      ttl and tos variables are declared and assigned, but are not used in
      iptunnel_xmit() function.
      
      Fixes: cfc7381b ("ip_tunnel: add collect_md mode to IPIP tunnel")
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f693f19
    • E
      ipv6: fix typo in fib6_net_exit() · 32a805ba
      Eric Dumazet 提交于
      IPv6 FIB should use FIB6_TABLE_HASHSZ, not FIB_TABLE_HASHSZ.
      
      Fixes: ba1cc08d ("ipv6: fix memory leak with multiple tables during netns destruction")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32a805ba
    • E
      tcp: fix a request socket leak · 1f3b359f
      Eric Dumazet 提交于
      While the cited commit fixed a possible deadlock, it added a leak
      of the request socket, since reqsk_put() must be called if the BPF
      filter decided the ACK packet must be dropped.
      
      Fixes: d624d276 ("tcp: fix possible deadlock in TCP stack vs BPF filter")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f3b359f
    • M
      sctp: fix missing wake ups in some situations · 7906b00f
      Marcelo Ricardo Leitner 提交于
      Commit fb586f25 ("sctp: delay calls to sk_data_ready() as much as
      possible") minimized the number of wake ups that are triggered in case
      the association receives a packet with multiple data chunks on it and/or
      when io_events are enabled and then commit 0970f5b3 ("sctp: signal
      sk_data_ready earlier on data chunks reception") moved the wake up to as
      soon as possible. It thus relies on the state machine running later to
      clean the flag that the event was already generated.
      
      The issue is that there are 2 call paths that calls
      sctp_ulpq_tail_event() outside of the state machine, causing the flag to
      linger and possibly omitting a needed wake up in the sequence.
      
      One of the call paths is when enabling SCTP_SENDER_DRY_EVENTS via
      setsockopt(SCTP_EVENTS), as noticed by Harald Welte. The other is when
      partial reliability triggers removal of chunks from the send queue when
      the application calls sendmsg().
      
      This commit fixes it by not setting the flag in case the socket is not
      owned by the user, as it won't be cleaned later. This works for
      user-initiated calls and also for rx path processing.
      
      Fixes: fb586f25 ("sctp: delay calls to sk_data_ready() as much as possible")
      Reported-by: NHarald Welte <laforge@gnumonks.org>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7906b00f
    • V
      netfilter: xt_hashlimit: fix build error caused by 64bit division · 90c4ae4e
      Vishwanath Pai 提交于
      64bit division causes build/link errors on 32bit architectures. It
      prints out error messages like:
      
      ERROR: "__aeabi_uldivmod" [net/netfilter/xt_hashlimit.ko] undefined!
      
      The value of avg passed through by userspace in BYTE mode cannot exceed
      U32_MAX. Which means 64bit division in user2rate_bytes is unnecessary.
      To fix this I have changed the type of param 'user' to u32.
      
      Since anything greater than U32_MAX is an invalid input we error out in
      hashlimit_mt_check_common() when this is the case.
      
      Changes in v2:
      	Making return type as u32 would cause an overflow for small
      	values of 'user' (for example 2, 3 etc). To avoid this I bumped up
      	'r' to u64 again as well as the return type. This is OK since the
      	variable that stores the result is u64. We still avoid 64bit
      	division here since 'user' is u32.
      
      Fixes: bea74641 ("netfilter: xt_hashlimit: add rate match mode")
      Signed-off-by: NVishwanath Pai <vpai@akamai.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      90c4ae4e
    • Z
      netfilter: xt_hashlimit: alloc hashtable with right size · 05d0eae7
      Zhizhou Tian 提交于
      struct xt_byteslimit_htable used hlist_head, but memory allocation is
      done through sizeof(struct list_head).
      Signed-off-by: NZhizhou Tian <zhizhou.tian@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      05d0eae7
    • F
      netfilter: core: remove erroneous warn_on · 74585d4f
      Florian Westphal 提交于
      kernel test robot reported:
      
      WARNING: CPU: 0 PID: 1244 at net/netfilter/core.c:218 __nf_hook_entries_try_shrink+0x49/0xcd
      [..]
      
      After allowing batching in nf_unregister_net_hooks its possible that an earlier
      call to __nf_hook_entries_try_shrink already compacted the list.
      If this happens we don't need to do anything.
      
      Fixes: d3ad2c17 ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls")
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      74585d4f
    • F
      netfilter: nat: use keyed locks · 8073e960
      Florian Westphal 提交于
      no need to serialize on a single lock, we can partition the table and
      add/delete in parallel to different slots.
      This restores one of the advantages that got lost with the rhlist
      revert.
      
      Cc: Ivan Babrou <ibobrik@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8073e960
    • F
      netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable" · e1bf1687
      Florian Westphal 提交于
      This reverts commit 870190a9.
      
      It was not a good idea. The custom hash table was a much better
      fit for this purpose.
      
      A fast lookup is not essential, in fact for most cases there is no lookup
      at all because original tuple is not taken and can be used as-is.
      What needs to be fast is insertion and deletion.
      
      rhlist removal however requires a rhlist walk.
      We can have thousands of entries in such a list if source port/addresses
      are reused for multiple flows, if this happens removal requests are so
      expensive that deletions of a few thousand flows can take several
      seconds(!).
      
      The advantages that we got from rhashtable are:
      1) table auto-sizing
      2) multiple locks
      
      1) would be nice to have, but it is not essential as we have at
      most one lookup per new flow, so even a million flows in the bysource
      table are not a problem compared to current deletion cost.
      2) is easy to add to custom hash table.
      
      I tried to add hlist_node to rhlist to speed up rhltable_remove but this
      isn't doable without changing semantics.  rhltable_remove_fast will
      check that the to-be-deleted object is part of the table and that
      requires a list walk that we want to avoid.
      
      Furthermore, using hlist_node increases size of struct rhlist_head, which
      in turn increases nf_conn size.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196821Reported-by: NIvan Babrou <ibobrik@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e1bf1687
    • F
      netfilter: xtables: add scheduling opportunity in get_counters · a5d7a714
      Florian Westphal 提交于
      There are reports about spurious softlockups during iptables-restore, a
      backtrace i saw points at get_counters -- it uses a sequence lock and also
      has unbounded restart loop.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a5d7a714
    • F
      netfilter: nf_nat: don't bug when mapping already exists · 75c26314
      Florian Westphal 提交于
      It seems preferrable to limp along if we have a conflicting mapping,
      its certainly better than a BUG().
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      75c26314
    • S
      ipv6: fix memory leak with multiple tables during netns destruction · ba1cc08d
      Sabrina Dubroca 提交于
      fib6_net_exit only frees the main and local tables. If another table was
      created with fib6_alloc_table, we leak it when the netns is destroyed.
      
      Fix this in the same way ip_fib_net_exit cleans up tables, by walking
      through the whole hashtable of fib6_table's. We can get rid of the
      special cases for local and main, since they're also part of the
      hashtable.
      
      Reproducer:
          ip netns add x
          ip -net x -6 rule add from 6003:1::/64 table 100
          ip netns del x
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Fixes: 58f09b78 ("[NETNS][IPV6] ip6_fib - make it per network namespace")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba1cc08d
  10. 08 9月, 2017 4 次提交
反馈
建议
客服 返回
顶部