1. 10 11月, 2019 3 次提交
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 0cfaf03c
      Eric Dumazet 提交于
      [ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ]
      
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0cfaf03c
    • E
      inet: stop leaking jiffies on the wire · 07de7389
      Eric Dumazet 提交于
      [ Upstream commit a904a0693c189691eeee64f6c6b188bd7dc244e9 ]
      
      Historically linux tried to stick to RFC 791, 1122, 2003
      for IPv4 ID field generation.
      
      RFC 6864 made clear that no matter how hard we try,
      we can not ensure unicity of IP ID within maximum
      lifetime for all datagrams with a given source
      address/destination address/protocol tuple.
      
      Linux uses a per socket inet generator (inet_id), initialized
      at connection startup with a XOR of 'jiffies' and other
      fields that appear clear on the wire.
      
      Thiemo Nagel pointed that this strategy is a privacy
      concern as this provides 16 bits of entropy to fingerprint
      devices.
      
      Let's switch to a random starting point, this is just as
      good as far as RFC 6864 is concerned and does not leak
      anything critical.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NThiemo Nagel <tnagel@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07de7389
    • X
      erspan: fix the tun_info options_len check for erspan · 163901dc
      Xin Long 提交于
      [ Upstream commit 2eb8d6d2910cfe3dc67dc056f26f3dd9c63d47cd ]
      
      The check for !md doens't really work for ip_tunnel_info_opts(info) which
      only does info + 1. Also to avoid out-of-bounds access on info, it should
      ensure options_len is not less than erspan_metadata in both erspan_xmit()
      and ip6erspan_tunnel_xmit().
      
      Fixes: 1a66a836 ("gre: add collect_md mode to ERSPAN tunnel")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      163901dc
  2. 29 10月, 2019 2 次提交
    • S
      ipv4: Return -ENETUNREACH if we can't create route but saddr is valid · 2fa80e64
      Stefano Brivio 提交于
      [ Upstream commit 595e0651d0296bad2491a4a29a7a43eae6328b02 ]
      
      ...instead of -EINVAL. An issue was found with older kernel versions
      while unplugging a NFS client with pending RPCs, and the wrong error
      code here prevented it from recovering once link is back up with a
      configured address.
      
      Incidentally, this is not an issue anymore since commit 4f8943f80883
      ("SUNRPC: Replace direct task wakeups from softirq context"), included
      in 5.2-rc7, had the effect of decoupling the forwarding of this error
      by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin
      Coddington.
      
      To the best of my knowledge, this isn't currently causing any further
      issue, but the error code doesn't look appropriate anyway, and we
      might hit this in other paths as well.
      
      In detail, as analysed by Gonzalo Siero, once the route is deleted
      because the interface is down, and can't be resolved and we return
      -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(),
      as the socket error seen by tcp_write_err(), called by
      tcp_retransmit_timer().
      
      In turn, tcp_write_err() indirectly calls xs_error_report(), which
      wakes up the RPC pending tasks with a status of -EINVAL. This is then
      seen by call_status() in the SUN RPC implementation, which aborts the
      RPC call calling rpc_exit(), instead of handling this as a
      potentially temporary condition, i.e. as a timeout.
      
      Return -EINVAL only if the input parameters passed to
      ip_route_output_key_hash_rcu() are actually invalid (this is the case
      if the specified source address is multicast, limited broadcast or
      all zeroes), but return -ENETUNREACH in all cases where, at the given
      moment, the given source address doesn't allow resolving the route.
      
      While at it, drop the initialisation of err to -ENETUNREACH, which
      was added to __ip_route_output_key() back then by commit
      0315e382 ("net: Fix behaviour of unreachable, blackhole and
      prohibit routes"), but actually had no effect, as it was, and is,
      overwritten by the fib_lookup() return code assignment, and anyway
      ignored in all other branches, including the if (fl4->saddr) one:
      I find this rather confusing, as it would look like -ENETUNREACH is
      the "default" error, while that statement has no effect.
      
      Also note that after commit fc75fc83 ("ipv4: dont create routes
      on down devices"), we would get -ENETUNREACH if the device is down,
      but -EINVAL if the source address is specified and we can't resolve
      the route, and this appears to be rather inconsistent.
      Reported-by: NStefan Walter <walteste@inf.ethz.ch>
      Analysed-by: NBenjamin Coddington <bcodding@redhat.com>
      Analysed-by: NGonzalo Siero <gsierohu@redhat.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2fa80e64
    • W
      ipv4: fix race condition between route lookup and invalidation · 2ec0df4e
      Wei Wang 提交于
      [ Upstream commit 5018c59607a511cdee743b629c76206d9c9e6d7b ]
      
      Jesse and Ido reported the following race condition:
      <CPU A, t0> - Received packet A is forwarded and cached dst entry is
      taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
      
      <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
      from multiple ISPs"), route is added / deleted and rt_cache_flush() is
      called
      
      <CPU B, t2> - Received packet B tries to use the same cached dst entry
      from t0, but rt_cache_valid() is no longer true and it is replaced in
      rt_cache_route() by the newer one. This calls dst_dev_put() on the
      original dst entry which assigns the blackhole netdev to 'dst->dev'
      
      <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
      to 'dst->dev' being the blackhole netdev
      
      There are 2 issues in the v4 routing code:
      1. A per-netns counter is used to do the validation of the route. That
      means whenever a route is changed in the netns, users of all routes in
      the netns needs to redo lookup. v6 has an implementation of only
      updating fn_sernum for routes that are affected.
      2. When rt_cache_valid() returns false, rt_cache_route() is called to
      throw away the current cache, and create a new one. This seems
      unnecessary because as long as this route does not change, the route
      cache does not need to be recreated.
      
      To fully solve the above 2 issues, it probably needs quite some code
      changes and requires careful testing, and does not suite for net branch.
      
      So this patch only tries to add the deleted cached rt into the uncached
      list, so user could still be able to use it to receive packets until
      it's done.
      
      Fixes: 95c47f9c ("ipv4: call dst_dev_put() properly")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Reported-by: NIdo Schimmel <idosch@idosch.org>
      Reported-by: NJesse Hathaway <jesse@mbuki-mvuki.org>
      Tested-by: NJesse Hathaway <jesse@mbuki-mvuki.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ec0df4e
  3. 08 10月, 2019 4 次提交
  4. 05 10月, 2019 1 次提交
  5. 01 10月, 2019 1 次提交
  6. 21 9月, 2019 1 次提交
  7. 19 9月, 2019 1 次提交
    • N
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR · 67fe3b94
      Neal Cardwell 提交于
      [ Upstream commit af38d07ed391b21f7405fa1f936ca9686787d6d2 ]
      
      Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
      TCP_ECN_QUEUE_CWR.
      
      Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
      the behavior of data receivers, and deciding whether to reflect
      incoming IP ECN CE marks as outgoing TCP th->ece marks. The
      TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
      and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
      is only called from tcp_undo_cwnd_reduction() by data senders during
      an undo, so it should zero the sender-side state,
      TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
      incoming CE bits on incoming data packets just because outgoing
      packets were spuriously retransmitted.
      
      The bug has been reproduced with packetdrill to manifest in a scenario
      with RFC3168 ECN, with an incoming data packet with CE bit set and
      carrying a TCP timestamp value that causes cwnd undo. Before this fix,
      the IP CE bit was ignored and not reflected in the TCP ECE header bit,
      and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
      even though the cwnd reduction had been undone.  After this fix, the
      sender properly reflects the CE bit and does not set the W bit.
      
      Note: the bug actually predates 2005 git history; this Fixes footer is
      chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
      the patch applies cleanly (since before this commit the code was in a
      .h file).
      
      Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67fe3b94
  8. 10 9月, 2019 2 次提交
  9. 06 9月, 2019 1 次提交
    • H
      ipv4/icmp: fix rt dst dev null pointer dereference · 9febfd30
      Hangbin Liu 提交于
      [ Upstream commit e2c693934194fd3b4e795635934883354c06ebc9 ]
      
      In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
      e,g, with tunnel collect_md mode, which will cause kernel crash.
      Here is what the code path looks like, for GRE:
      
      - ip6gre_tunnel_xmit
        - ip6gre_xmit_ipv4
          - __gre6_xmit
            - ip6_tnl_xmit
              - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
          - icmp_send
            - net = dev_net(rt->dst.dev); <-- here
      
      The reason is __metadata_dst_init() init dst->dev to NULL by default.
      We could not fix it in __metadata_dst_init() as there is no dev supplied.
      On the other hand, the reason we need rt->dst.dev is to get the net.
      So we can just try get it from skb->dev when rt->dst.dev is NULL.
      
      v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
      still use dst.dev and do a check to avoid crash.
      
      v3: No changes.
      
      v2: fix the issue in __icmp_send() instead of updating shared dst dev
      in {ip_md, ip6}_tunnel_xmit.
      
      Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9febfd30
  10. 16 8月, 2019 1 次提交
  11. 09 8月, 2019 1 次提交
  12. 04 8月, 2019 1 次提交
  13. 28 7月, 2019 5 次提交
    • C
      tcp: Reset bytes_acked and bytes_received when disconnecting · 1b200acd
      Christoph Paasch 提交于
      [ Upstream commit e858faf556d4e14c750ba1e8852783c6f9520a0e ]
      
      If an app is playing tricks to reuse a socket via tcp_disconnect(),
      bytes_acked/received needs to be reset to 0. Otherwise tcp_info will
      report the sum of the current and the old connection..
      
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Fixes: bdd1f9ed ("tcp: add tcpi_bytes_received to tcp_info")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1b200acd
    • E
      tcp: fix tcp_set_congestion_control() use from bpf hook · c60f57df
      Eric Dumazet 提交于
      [ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ]
      
      Neal reported incorrect use of ns_capable() from bpf hook.
      
      bpf_setsockopt(...TCP_CONGESTION...)
        -> tcp_set_congestion_control()
         -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
          -> ns_capable_common()
           -> current_cred()
            -> rcu_dereference_protected(current->cred, 1)
      
      Accessing 'current' in bpf context makes no sense, since packets
      are processed from softirq context.
      
      As Neal stated : The capability check in tcp_set_congestion_control()
      was written assuming a system call context, and then was reused from
      a BPF call site.
      
      The fix is to add a new parameter to tcp_set_congestion_control(),
      so that the ns_capable() call is only performed under the right
      context.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c60f57df
    • E
      tcp: be more careful in tcp_fragment() · 6323c238
      Eric Dumazet 提交于
      [ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]
      
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6323c238
    • M
      ipv4: don't set IPv6 only flags to IPv4 addresses · 47ce4427
      Matteo Croce 提交于
      [ Upstream commit 2e60546368165c2449564d71f6005dda9205b5fb ]
      
      Avoid the situation where an IPV6 only flag is applied to an IPv4 address:
      
          # ip addr add 192.0.2.1/24 dev dummy0 nodad home mngtmpaddr noprefixroute
          # ip -4 addr show dev dummy0
          2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
              inet 192.0.2.1/24 scope global noprefixroute dummy0
                 valid_lft forever preferred_lft forever
      
      Or worse, by sending a malicious netlink command:
      
          # ip -4 addr show dev dummy0
          2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
              inet 192.0.2.1/24 scope global nodad optimistic dadfailed home tentative mngtmpaddr noprefixroute stable-privacy dummy0
                 valid_lft forever preferred_lft forever
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47ce4427
    • E
      igmp: fix memory leak in igmpv3_del_delrec() · aee5dd00
      Eric Dumazet 提交于
      [ Upstream commit e5b1c6c6277d5a283290a8c033c72544746f9b5b ]
      
      im->tomb and/or im->sources might not be NULL, but we
      currently overwrite their values blindly.
      
      Using swap() will make sure the following call to kfree_pmc(pmc)
      will properly free the psf structures.
      
      Tested with the C repro provided by syzbot, which basically does :
      
       socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
       setsockopt(3, SOL_IP, IP_ADD_MEMBERSHIP, "\340\0\0\2\177\0\0\1\0\0\0\0", 12) = 0
       ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=0}) = 0
       setsockopt(3, SOL_IP, IP_MSFILTER, "\340\0\0\2\177\0\0\1\1\0\0\0\1\0\0\0\377\377\377\377", 20) = 0
       ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP}) = 0
       exit_group(0)                    = ?
      
      BUG: memory leak
      unreferenced object 0xffff88811450f140 (size 64):
        comm "softirq", pid 0, jiffies 4294942448 (age 32.070s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  ................
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000c7bad083>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<00000000c7bad083>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000c7bad083>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000c7bad083>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<000000009acc4151>] kmalloc include/linux/slab.h:547 [inline]
          [<000000009acc4151>] kzalloc include/linux/slab.h:742 [inline]
          [<000000009acc4151>] ip_mc_add1_src net/ipv4/igmp.c:1976 [inline]
          [<000000009acc4151>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2100
          [<000000004ac14566>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2484
          [<0000000052d8f995>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:959
          [<000000004ee1e21f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1248
          [<0000000066cdfe74>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2618
          [<000000009383a786>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3126
          [<00000000d8ac0c94>] __sys_setsockopt+0x98/0x120 net/socket.c:2072
          [<000000001b1e9666>] __do_sys_setsockopt net/socket.c:2083 [inline]
          [<000000001b1e9666>] __se_sys_setsockopt net/socket.c:2080 [inline]
          [<000000001b1e9666>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2080
          [<00000000420d395e>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
          [<000000007fd83a4b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 24803f38 ("igmp: do not remove igmp souce list info when set link down")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      Reported-by: syzbot+6ca1abd0db68b5173a4f@syzkaller.appspotmail.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aee5dd00
  14. 03 7月, 2019 3 次提交
    • M
      bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro · 79c6a8c0
      Martin KaFai Lau 提交于
      commit 257a525fe2e49584842c504a92c27097407f778f upstream.
      
      When the commit a6024562 ("udp: Add GRO functions to UDP socket")
      added udp[46]_lib_lookup_skb to the udp_gro code path, it broke
      the reuseport_select_sock() assumption that skb->data is pointing
      to the transport header.
      
      This patch follows an earlier __udp6_lib_err() fix by
      passing a NULL skb to avoid calling the reuseport's bpf_prog.
      
      Fixes: a6024562 ("udp: Add GRO functions to UDP socket")
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79c6a8c0
    • D
      bpf: fix unconnected udp hooks · 613bc37f
      Daniel Borkmann 提交于
      commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      613bc37f
    • S
      ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop · 7c92f3ef
      Stephen Suryaputra 提交于
      [ Upstream commit 38c73529de13e1e10914de7030b659a2f8b01c3b ]
      
      In commit 19e4e768064a8 ("ipv4: Fix raw socket lookup for local
      traffic"), the dif argument to __raw_v4_lookup() is coming from the
      returned value of inet_iif() but the change was done only for the first
      lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex.
      
      Fixes: 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic")
      Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c92f3ef
  15. 22 6月, 2019 1 次提交
  16. 18 6月, 2019 4 次提交
  17. 11 6月, 2019 1 次提交
    • X
      ipv4: not do cache for local delivery if bc_forwarding is enabled · daa11cc8
      Xin Long 提交于
      [ Upstream commit 0a90478b93a46bdcd56ba33c37566a993e455d54 ]
      
      With the topo:
      
          h1 ---| rp1            |
                |     route  rp3 |--- h3 (192.168.200.1)
          h2 ---| rp2            |
      
      If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
      doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
      h2, and the packets can still be forwared.
      
      This issue was caused by the input route cache. It should only do
      the cache for either bc forwarding or local delivery. Otherwise,
      local delivery can use the route cache for bc forwarding of other
      interfaces.
      
      This patch is to fix it by not doing cache for local delivery if
      all.bc_forwarding is enabled.
      
      Note that we don't fix it by checking route cache local flag after
      rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
      common route code shouldn't be touched for bc_forwarding.
      
      Fixes: 5cbf777c ("route: add support for directed broadcast forwarding")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      daa11cc8
  18. 04 6月, 2019 3 次提交
    • E
      ipv4/igmp: fix build error if !CONFIG_IP_MULTICAST · 46702dd5
      Eric Dumazet 提交于
      [ Upstream commit 903869bd10e6719b9df6718e785be7ec725df59f ]
      
      ip_sf_list_clear_all() needs to be defined even if !CONFIG_IP_MULTICAST
      
      Fixes: 3580d04aa674 ("ipv4/igmp: fix another memory leak in igmpv3_del_delrec()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46702dd5
    • E
      ipv4/igmp: fix another memory leak in igmpv3_del_delrec() · e9f94e48
      Eric Dumazet 提交于
      [ Upstream commit 3580d04aa674383c42de7b635d28e52a1e5bc72c ]
      
      syzbot reported memory leaks [1] that I have back tracked to
      a missing cleanup from igmpv3_del_delrec() when
      (im->sfmode != MCAST_INCLUDE)
      
      Add ip_sf_list_clear_all() and kfree_pmc() helpers to explicitely
      handle the cleanups before freeing.
      
      [1]
      
      BUG: memory leak
      unreferenced object 0xffff888123e32b00 (size 64):
        comm "softirq", pid 0, jiffies 4294942968 (age 8.010s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 e0 00 00 01 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000006105011b>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<000000006105011b>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<000000006105011b>] slab_alloc mm/slab.c:3326 [inline]
          [<000000006105011b>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<000000004bba8073>] kmalloc include/linux/slab.h:547 [inline]
          [<000000004bba8073>] kzalloc include/linux/slab.h:742 [inline]
          [<000000004bba8073>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<000000004bba8073>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<00000000a46a65a0>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<000000005956ca89>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:957
          [<00000000848e2d2f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<00000000b9db185c>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<000000003028e438>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
          [<0000000015b65589>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
          [<00000000ac198ef0>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000ac198ef0>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000ac198ef0>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<000000000a770437>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
          [<00000000d3adb93b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 9c8bb163 ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9f94e48
    • E
      inet: switch IP ID generator to siphash · 07480da0
      Eric Dumazet 提交于
      [ Upstream commit df453700e8d81b1bdafdf684365ee2b9431fb702 ]
      
      According to Amit Klein and Benny Pinkas, IP ID generation is too weak
      and might be used by attackers.
      
      Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
      having 64bit key and Jenkins hash is risky.
      
      It is time to switch to siphash and its 128bit keys.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07480da0
  19. 26 5月, 2019 3 次提交
  20. 17 5月, 2019 1 次提交