1. 26 2月, 2022 1 次提交
  2. 18 2月, 2022 1 次提交
    • E
      net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57
      Eric Dumazet 提交于
      UDP sendmsg() can be lockless, this is causing all kinds
      of data races.
      
      This patch converts sk->sk_tskey to remove one of these races.
      
      BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
      
      read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
       __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
       __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000054d -> 0x0000054e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1cdec57
  3. 30 1月, 2022 1 次提交
  4. 28 1月, 2022 1 次提交
    • E
      ipv4: tcp: send zero IPID in SYNACK messages · 970a5a3e
      Eric Dumazet 提交于
      In commit 431280ee ("ipv4: tcp: send zero IPID for RST and
      ACK sent in SYN-RECV and TIME-WAIT state") we took care of some
      ctl packets sent by TCP.
      
      It turns out we need to use a similar strategy for SYNACK packets.
      
      By default, they carry IP_DF and IPID==0, but there are ways
      to ask them to use the hashed IP ident generator and thus
      be used to build off-path attacks.
      (Ref: Off-Path TCP Exploits of the Mixed IPID Assignment)
      
      One of this way is to force (before listener is started)
      echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc
      
      Another way is using forged ICMP ICMP_FRAG_NEEDED
      with a very small MTU (like 68) to force a false return from
      ip_dont_fragment()
      
      In this patch, ip_build_and_send_pkt() uses the following
      heuristics.
      
      1) Most SYNACK packets are smaller than IPV4_MIN_MTU and therefore
      can use IP_DF regardless of the listener or route pmtu setting.
      
      2) In case the SYNACK packet is bigger than IPV4_MIN_MTU,
      we use prandom_u32() generator instead of the IPv4 hashed ident one.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NRay Che <xijiache@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Cc: Geoff Alexander <alexandg@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      970a5a3e
  5. 24 1月, 2022 1 次提交
    • J
      ipv4: fix ip option filtering for locally generated fragments · 27a8caa5
      Jakub Kicinski 提交于
      During IP fragmentation we sanitize IP options. This means overwriting
      options which should not be copied with NOPs. Only the first fragment
      has the original, full options.
      
      ip_fraglist_prepare() copies the IP header and options from previous
      fragment to the next one. Commit 19c3401a ("net: ipv4: place control
      buffer handling away from fragmentation iterators") moved sanitizing
      options before ip_fraglist_prepare() which means options are sanitized
      and then overwritten again with the old values.
      
      Fixing this is not enough, however, nor did the sanitization work
      prior to aforementioned commit.
      
      ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
      for the length of the options. ipcb->opt of fragments is not populated
      (it's 0), only the head skb has the state properly built. So even when
      called at the right time ip_options_fragment() does nothing. This seems
      to date back all the way to v2.5.44 when the fast path for pre-fragmented
      skbs had been introduced. Prior to that ip_options_build() would have been
      called for every fragment (in fact ever since v2.5.44 the fragmentation
      handing in ip_options_build() has been dead code, I'll clean it up in
      -next).
      
      In the original patch (see Link) caixf mentions fixing the handling
      for fragments other than the second one, but I'm not sure how _any_
      fragment could have had their options sanitized with the code
      as it stood.
      
      Tested with python (MTU on lo lowered to 1000 to force fragmentation):
      
        import socket
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
                     bytearray([7,4,5,192, 20|0x80,4,1,0]))
        s.sendto(b'1'*2000, ('127.0.0.1', 1234))
      
      Before:
      
      IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost.36500 > localhost.search-agent: UDP, length 2000
      IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost > localhost: udp
      IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost > localhost: udp
      
      After:
      
      IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
      IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
          localhost > localhost: udp
      IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
          localhost > localhost: udp
      
      RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".
      
      Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
      Fixes: 19c3401a ("net: ipv4: place control buffer handling away from fragmentation iterators")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: Ncaixf <ooppublic@163.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27a8caa5
  6. 14 11月, 2021 1 次提交
  7. 30 8月, 2021 1 次提交
  8. 24 8月, 2021 1 次提交
  9. 03 8月, 2021 1 次提交
  10. 27 7月, 2021 1 次提交
  11. 25 6月, 2021 1 次提交
    • J
      net: ip: avoid OOM kills with large UDP sends over loopback · 6d123b81
      Jakub Kicinski 提交于
      Dave observed number of machines hitting OOM on the UDP send
      path. The workload seems to be sending large UDP packets over
      loopback. Since loopback has MTU of 64k kernel will try to
      allocate an skb with up to 64k of head space. This has a good
      chance of failing under memory pressure. What's worse if
      the message length is <32k the allocation may trigger an
      OOM killer.
      
      This is entirely avoidable, we can use an skb with page frags.
      
      af_unix solves a similar problem by limiting the head
      length to SKB_MAX_ALLOC. This seems like a good and simple
      approach. It means that UDP messages > 16kB will now
      use fragments if underlying device supports SG, if extra
      allocator pressure causes regressions in real workloads
      we can switch to trying the large allocation first and
      falling back.
      
      v4: pre-calculate all the additions to alloclen so
          we can be sure it won't go over order-2
      Reported-by: NDave Jones <dsj@fb.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d123b81
  12. 29 3月, 2021 1 次提交
  13. 04 2月, 2021 1 次提交
  14. 08 1月, 2021 4 次提交
  15. 24 11月, 2020 1 次提交
  16. 11 9月, 2020 1 次提交
  17. 09 9月, 2020 1 次提交
  18. 01 9月, 2020 2 次提交
  19. 25 8月, 2020 1 次提交
  20. 21 8月, 2020 2 次提交
  21. 02 7月, 2020 1 次提交
    • W
      ip: Fix SO_MARK in RST, ACK and ICMP packets · 0da7536f
      Willem de Bruijn 提交于
      When no full socket is available, skbs are sent over a per-netns
      control socket. Its sk_mark is temporarily adjusted to match that
      of the real (request or timewait) socket or to reflect an incoming
      skb, so that the outgoing skb inherits this in __ip_make_skb.
      
      Introduction of the socket cookie mark field broke this. Now the
      skb is set through the cookie and cork:
      
      <caller>		# init sockc.mark from sk_mark or cmsg
      ip_append_data
        ip_setup_cork		# convert sockc.mark to cork mark
      ip_push_pending_frames
        ip_finish_skb
          __ip_make_skb	# set skb->mark to cork mark
      
      But I missed these special control sockets. Update all callers of
      __ip(6)_make_skb that were originally missed.
      
      For IPv6, the same two icmp(v6) paths are affected. The third
      case is not, as commit 92e55f41 ("tcp: don't annotate
      mark on control socket from tcp_v6_send_response()") replaced
      the ctl_sk->sk_mark with passing the mark field directly as a
      function argument. That commit predates the commit that
      introduced the bug.
      
      Fixes: c6af0c22 ("ip: support SO_MARK cmsg")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0da7536f
  22. 21 6月, 2020 1 次提交
  23. 30 3月, 2020 1 次提交
  24. 13 3月, 2020 1 次提交
  25. 15 1月, 2020 1 次提交
  26. 08 12月, 2019 1 次提交
    • E
      inet: protect against too small mtu values. · 501a90c9
      Eric Dumazet 提交于
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      501a90c9
  27. 16 11月, 2019 1 次提交
  28. 22 10月, 2019 1 次提交
  29. 19 10月, 2019 1 次提交
  30. 27 9月, 2019 1 次提交
    • E
      tcp: honor SO_PRIORITY in TIME_WAIT state · f6c0f5d2
      Eric Dumazet 提交于
      ctl packets sent on behalf of TIME_WAIT sockets currently
      have a zero skb->priority, which can cause various problems.
      
      In this patch we :
      
      - add a tw_priority field in struct inet_timewait_sock.
      
      - populate it from sk->sk_priority when a TIME_WAIT is created.
      
      - For IPv4, change ip_send_unicast_reply() and its two
        callers to propagate tw_priority correctly.
        ip_send_unicast_reply() no longer changes sk->sk_priority.
      
      - For IPv6, make sure TIME_WAIT sockets pass their tw_priority
        field to tcp_v6_send_response() and tcp_v6_send_ack().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6c0f5d2
  31. 21 9月, 2019 1 次提交
  32. 14 9月, 2019 1 次提交
    • W
      ip: support SO_MARK cmsg · c6af0c22
      Willem de Bruijn 提交于
      Enable setting skb->mark for UDP and RAW sockets using cmsg.
      
      This is analogous to existing support for TOS, TTL, txtime, etc.
      
      Packet sockets already support this as of commit c7d39e32
      ("packet: support per-packet fwmark for af_packet sendmsg").
      
      Similar to other fields, implement by
      1. initialize the sockcm_cookie.mark from socket option sk_mark
      2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
      3. initialize inet_cork.mark from sockcm_cookie.mark
      4. initialize each (usually just one) skb->mark from inet_cork.mark
      
      Step 1 is handled in one location for most protocols by ipcm_init_sk
      as of commit 35178206 ("ipv4: ipcm_cookie initializers").
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6af0c22
  33. 27 6月, 2019 1 次提交
  34. 15 6月, 2019 1 次提交
  35. 12 6月, 2019 1 次提交