1. 16 4月, 2020 1 次提交
  2. 15 4月, 2020 2 次提交
    • Z
      sock: fix potential memory leak in proto_register() · 510cdc8f
      zhanglin 提交于
      mainline inclusion
      from mainline-v5.3-rc7
      commit b45ce32135d1c82a5bf12aa56957c3fd27956057
      category: bugfix
      bugzilla: 20825
      CVE: NA
      
      -------------------------------------
      
      If protocols registered exceeded PROTO_INUSE_NR, prot will be
      added to proto_list, but no available bit left for prot in
      proto_inuse_idx.
      
      Changes since v2:
      * Propagate the error code properly
      Signed-off-by: Nzhanglin <zhang.lin16@zte.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: Nguodeqing <geffrey.guo@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      510cdc8f
    • E
      net: silence KCSAN warnings around sk_add_backlog() calls · b25fcaeb
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-5.4-rc4
      commit 8265792bf887
      category: bugfix
      bugzilla: 24071
      CVE: NA
      
      -------------------------------------------------
      
      sk_add_backlog() callers usually read sk->sk_rcvbuf without
      owning the socket lock. This means sk_rcvbuf value can
      be changed by other cpus, and KCSAN complains.
      
      Add READ_ONCE() annotations to document the lockless nature
      of these reads.
      
      Note that writes over sk_rcvbuf should also use WRITE_ONCE(),
      but this will be done in separate patches to ease stable
      backports (if we decide this is relevant for stable trees).
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff88812ab369f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:902 [inline]
       sk_add_backlog include/net/sock.h:933 [inline]
       tcp_add_backlog+0x45a/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812ab369f8 of 8 bytes by task 7271 on cpu 0:
       tcp_recvmsg+0x470/0x1a30 net/ipv4/tcp.c:2047
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7271 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NHuang Guobin <huangguobin4@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      b25fcaeb
  3. 27 12月, 2019 10 次提交
    • E
      net: use skb_queue_empty_lockless() in busy poll contexts · 027c3d07
      Eric Dumazet 提交于
      [ Upstream commit 3f926af3f4d688e2e11e7f8ed04e277a14d4d4a4 ]
      
      Busy polling usually runs without locks.
      Let's use skb_queue_empty_lockless() instead of skb_queue_empty()
      
      Also uses READ_ONCE() in __skb_try_recv_datagram() to address
      a similar potential problem.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      027c3d07
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 46b32fd2
      Eric Dumazet 提交于
      [ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ]
      
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        net/ipv4/udp.c
        net/ipv4/inet_hashtables.c
        net/ipv6/inet6_hashtables.c
        net/ipv6/udp.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      46b32fd2
    • M
      net: Unpublish sk from sk_reuseport_cb before call_rcu · b5b4326a
      Martin KaFai Lau 提交于
      [ Upstream commit 8c7138b33e5c690c308b2a7085f6313fdcb3f616 ]
      
      The "reuse->sock[]" array is shared by multiple sockets.  The going away
      sk must unpublish itself from "reuse->sock[]" before making call_rcu()
      call.  However, this unpublish-action is currently done after a grace
      period and it may cause use-after-free.
      
      The fix is to move reuseport_detach_sock() to sk_destruct().
      Due to the above reason, any socket with sk_reuseport_cb has
      to go through the rcu grace period before freeing it.
      
      It is a rather old bug (~3 yrs).  The Fixes tag is not necessary
      the right commit but it is the one that introduced the SOCK_RCU_FREE
      logic and this fix is depending on it.
      
      Fixes: a4298e45 ("net: add SOCK_RCU_FREE socket flag")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b5b4326a
    • J
      net: remove duplicate fetch in sock_getsockopt · be385514
      JingYi Hou 提交于
      [ Upstream commit d0bae4a0e3d8c5690a885204d7eb2341a5b4884d ]
      
      In sock_getsockopt(), 'optlen' is fetched the first time from userspace.
      'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is
      fetched the second time from userspace.
      
      If change it between two fetches may cause security problems or unexpected
      behaivor, and there is no reason to fetch it a second time.
      
      To fix this, we need to remove the second fetch.
      Signed-off-by: NJingYi Hou <houjingyi647@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      be385514
    • G
      sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values · 550542b2
      Guillaume Nault 提交于
      mainline inclusion
      from mainline-5.0
      commit 4057765f2de
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      
      SO_SNDBUF and SO_RCVBUF (and their *BUFFORCE version) may overflow or
      underflow their input value. This patch aims at providing explicit
      handling of these extreme cases, to get a clear behaviour even with
      values bigger than INT_MAX / 2 or lower than INT_MIN / 2.
      
      For simplicity, only SO_SNDBUF and SO_SNDBUFFORCE are described here,
      but the same explanation and fix apply to SO_RCVBUF and SO_RCVBUFFORCE
      (with 'SNDBUF' replaced by 'RCVBUF' and 'wmem_max' by 'rmem_max').
      
      Overflow of positive values
      
      ===========================
      
      When handling SO_SNDBUF or SO_SNDBUFFORCE, if 'val' exceeds
      INT_MAX / 2, the buffer size is set to its minimum value because
      'val * 2' overflows, and max_t() considers that it's smaller than
      SOCK_MIN_SNDBUF. For SO_SNDBUF, this can only happen with
      net.core.wmem_max > INT_MAX / 2.
      
      SO_SNDBUF and SO_SNDBUFFORCE are actually designed to let users probe
      for the maximum buffer size by setting an arbitrary large number that
      gets capped to the maximum allowed/possible size. Having the upper
      half of the positive integer space to potentially reduce the buffer
      size to its minimum value defeats this purpose.
      
      This patch caps the base value to INT_MAX / 2, so that bigger values
      don't overflow and keep setting the buffer size to its maximum.
      
      Underflow of negative values
      ============================
      
      For negative numbers, SO_SNDBUF always considers them bigger than
      net.core.wmem_max, which is bounded by [SOCK_MIN_SNDBUF, INT_MAX].
      Therefore such values are set to net.core.wmem_max and we're back to
      the behaviour of positive integers described above (return maximum
      buffer size if wmem_max <= INT_MAX / 2, return SOCK_MIN_SNDBUF
      otherwise).
      
      However, SO_SNDBUFFORCE behaves differently. The user value is
      directly multiplied by two and compared with SOCK_MIN_SNDBUF. If
      'val * 2' doesn't underflow or if it underflows to a value smaller
      than SOCK_MIN_SNDBUF then buffer size is set to its minimum value.
      Otherwise the buffer size is set to the underflowed value.
      
      This patch treats negative values passed to SO_SNDBUFFORCE as null, to
      prevent underflows. Therefore negative values now always set the buffer
      size to its minimum value.
      
      Even though SO_SNDBUF behaves inconsistently by setting buffer size to
      the maximum value when passed a negative number, no attempt is made to
      modify this behaviour. There may exist some programs that rely on using
      negative numbers to set the maximum buffer size. Avoiding overflows
      because of extreme net.core.wmem_max values is the most we can do here.
      
      Summary of altered behaviours
      =============================
      
      val      : user-space value passed to setsockopt()
      val_uf   : the underflowed value resulting from doubling val when
                 val < INT_MIN / 2
      wmem_max : short for net.core.wmem_max
      val_cap  : min(val, wmem_max)
      min_len  : minimal buffer length (that is, SOCK_MIN_SNDBUF)
      max_len  : maximal possible buffer length, regardless of wmem_max (that
                 is, INT_MAX - 1)
      ^^^^     : altered behaviour
      
      SO_SNDBUF:
      +-------------------------+-------------+------------+----------------+
      |       CONDITION         | OLD RESULT  | NEW RESULT |    COMMENT     |
      +-------------------------+-------------+------------+----------------+
      | val < 0 &&              |             |            | No overflow,   |
      | wmem_max <= INT_MAX/2   | wmem_max*2  | wmem_max*2 | keep original  |
      |                         |             |            | behaviour      |
      +-------------------------+-------------+------------+----------------+
      | val < 0 &&              |             |            | Cap wmem_max   |
      | INT_MAX/2 < wmem_max    | min_len     | max_len    | to prevent     |
      |                         |             | ^^^^^^^    | overflow       |
      +-------------------------+-------------+------------+----------------+
      | 0 <= val <= min_len/2   | min_len     | min_len    | Ordinary case  |
      +-------------------------+-------------+------------+----------------+
      | min_len/2 < val &&      | val_cap*2   | val_cap*2  | Ordinary case  |
      | val_cap <= INT_MAX/2    |             |            |                |
      +-------------------------+-------------+------------+----------------+
      | min_len < val &&        |             |            | Cap val_cap    |
      | INT_MAX/2 < val_cap     | min_len     | max_len    | again to       |
      | (implies that           |             | ^^^^^^^    | prevent        |
      | INT_MAX/2 < wmem_max)   |             |            | overflow       |
      +-------------------------+-------------+------------+----------------+
      
      SO_SNDBUFFORCE:
      +------------------------------+---------+---------+------------------+
      |          CONDITION           | BEFORE  | AFTER   |     COMMENT      |
      |                              | PATCH   | PATCH   |                  |
      +------------------------------+---------+---------+------------------+
      | val < INT_MIN/2 &&           | min_len | min_len | Underflow with   |
      | val_uf <= min_len            |         |         | no consequence   |
      +------------------------------+---------+---------+------------------+
      | val < INT_MIN/2 &&           | val_uf  | min_len | Set val to 0 to  |
      | val_uf > min_len             |         | ^^^^^^^ | avoid underflow  |
      +------------------------------+---------+---------+------------------+
      | INT_MIN/2 <= val < 0         | min_len | min_len | No underflow     |
      +------------------------------+---------+---------+------------------+
      | 0 <= val <= min_len/2        | min_len | min_len | Ordinary case    |
      +------------------------------+---------+---------+------------------+
      | min_len/2 < val <= INT_MAX/2 | val*2   | val*2   | Ordinary case    |
      +------------------------------+---------+---------+------------------+
      | INT_MAX/2 < val              | min_len | max_len | Cap val to       |
      |                              |         | ^^^^^^^ | prevent overflow |
      +------------------------------+---------+---------+------------------+
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      550542b2
    • M
      net: ensure unbound datagram socket to be chosen when not in a VRF · 16d706b7
      Mike Manning 提交于
      mainline inclusion
      from mainline-v5.0
      commit 6da5b0f027a8
      category: bugfix
      bugzilla: 9289
      CVE: NA
      
      -------------------------------------------------
      
      Ensure an unbound datagram skt is chosen when not in a VRF. The check
      for a device match in compute_score() for UDP must be performed when
      there is no device match. For this, a failure is returned when there is
      no device match. This ensures that bound sockets are never selected,
      even if there is no unbound socket.
      
      Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
      packets are currently blocked, as flowi6_oif was set to that of the
      master vrf device, and the ipi6_ifindex is that of the slave device.
      Allow these packets to be sent by checking the device with ipi6_ifindex
      has the same L3 scope as that of the bound device of the skt, which is
      the master vrf device. Note that this check always succeeds if the skt
      is unbound.
      
      Even though the right datagram skt is now selected by compute_score(),
      a different skt is being returned that is bound to the wrong vrf. The
      difference between these and stream sockets is the handling of the skt
      option for SO_REUSEPORT. While the handling when adding a skt for reuse
      correctly checks that the bound device of the skt is a match, the skts
      in the hashslot are already incorrect. So for the same hash, a skt for
      the wrong vrf may be selected for the required port. The root cause is
      that the skt is immediately placed into a slot when it is created,
      but when the skt is then bound using SO_BINDTODEVICE, it remains in the
      same slot. The solution is to move the skt to the correct slot by
      forcing a rehash.
      Signed-off-by: NMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
      Reviewed-by: NKeefe LIU <liuqifa@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      16d706b7
    • E
      net: fix possible overflow in __sk_mem_raise_allocated() · 362cf792
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-v5.0
      commit 5bf325a53202
      category: bugfix
      bugzilla: 9551
      CVE: NA
      
      -------------------------------------------------
      
      With many active TCP sockets, fat TCP sockets could fool
      __sk_mem_raise_allocated() thanks to an overflow.
      
      They would increase their share of the memory, instead
      of decreasing it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NShangli <shangli1@huawei.com>
      Reviewed-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      362cf792
    • A
      sock_diag: fix autoloading of the raw_diag module · ccc7cd64
      Andrei Vagin 提交于
      mainline inclusion
      from mainline-4.20
      commit c34c1287778b
      category: bugfix
      bugzilla: 6053
      CVE: NA
      
      -------------------------------------------------
      
      IPPROTO_RAW isn't registred as an inet protocol, so
      inet_protos[protocol] is always NULL for it.
      
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Fixes: bf2ae2e4 ("sock_diag: request _diag module only when the family or proto has been registered")
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ccc7cd64
    • Y
      net: call sk_dst_reset when set SO_DONTROUTE · 37b09172
      yupeng 提交于
      [ Upstream commit 0fbe82e628c817e292ff588cd5847fc935e025f2 ]
      
      after set SO_DONTROUTE to 1, the IP layer should not route packets if
      the dest IP address is not in link scope. But if the socket has cached
      the dst_entry, such packets would be routed until the sk_dst_cache
      expires. So we should clean the sk_dst_cache when a user set
      SO_DONTROUTE option. Below are server/client python scripts which
      could reprodue this issue:
      
      server side code:
      
      ==========================================================================
      import socket
      import struct
      import time
      
      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      s.bind(('0.0.0.0', 9000))
      s.listen(1)
      sock, addr = s.accept()
      sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
      while True:
          sock.send(b'foo')
          time.sleep(1)
      ==========================================================================
      
      client side code:
      ==========================================================================
      import socket
      import time
      
      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      s.connect(('server_address', 9000))
      while True:
          data = s.recv(1024)
          print(data)
      ==========================================================================
      Signed-off-by: Nyupeng <yupeng0921@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      37b09172
    • D
      sock: Make sock->sk_stamp thread-safe · a2a9331c
      Deepa Dinamani 提交于
      [ Upstream commit 3a0ed3e9619738067214871e9cb826fa23b2ddb9 ]
      
      Al Viro mentioned (Message-ID
      <20170626041334.GZ10672@ZenIV.linux.org.uk>)
      that there is probably a race condition
      lurking in accesses of sk_stamp on 32-bit machines.
      
      sock->sk_stamp is of type ktime_t which is always an s64.
      On a 32 bit architecture, we might run into situations of
      unsafe access as the access to the field becomes non atomic.
      
      Use seqlocks for synchronization.
      This allows us to avoid using spinlocks for readers as
      readers do not need mutual exclusion.
      
      Another approach to solve this is to require sk_lock for all
      modifications of the timestamps. The current approach allows
      for timestamps to have their own lock: sk_stamp_lock.
      This allows for the patch to not compete with already
      existing critical sections, and side effects are limited
      to the paths in the patch.
      
      The addition of the new field maintains the data locality
      optimizations from
      commit 9115e8cd ("net: reorganize struct sock for better data
      locality")
      
      Note that all the instances of the sk_stamp accesses
      are either through the ioctl or the syscall recvmsg.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a2a9331c
  4. 01 12月, 2018 1 次提交
  5. 07 8月, 2018 1 次提交
  6. 03 8月, 2018 1 次提交
  7. 24 7月, 2018 1 次提交
  8. 04 7月, 2018 2 次提交
    • J
      net/sched: Make etf report drops on error_queue · 4b15c707
      Jesus Sanchez-Palencia 提交于
      Use the socket error queue for reporting dropped packets if the
      socket has enabled that feature through the SO_TXTIME API.
      
      Packets are dropped either on enqueue() if they aren't accepted by the
      qdisc or on dequeue() if the system misses their deadline. Those are
      reported as different errors so applications can react accordingly.
      
      Userspace can retrieve the errors through the socket error queue and the
      corresponding cmsg interfaces. A struct sock_extended_err* is used for
      returning the error data, and the packet's timestamp can be retrieved by
      adding both ee_data and ee_info fields as e.g.:
      
          ((__u64) serr->ee_data << 32) + serr->ee_info
      
      This feature is disabled by default and must be explicitly enabled by
      applications. Enabling it can bring some overhead for the Tx cycles
      of the application.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b15c707
    • R
      net: Add a new socket option for a future transmit time. · 80b14dee
      Richard Cochran 提交于
      This patch introduces SO_TXTIME. User space enables this option in
      order to pass a desired future transmit time in a CMSG when calling
      sendmsg(2). The argument to this socket option is a 8-bytes long struct
      provided by the uapi header net_tstamp.h defined as:
      
      struct sock_txtime {
      	clockid_t 	clockid;
      	u32		flags;
      };
      
      Note that new fields were added to struct sock by filling a 2-bytes
      hole found in the struct. For that reason, neither the struct size or
      number of cachelines were altered.
      Signed-off-by: NRichard Cochran <rcochran@linutronix.de>
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80b14dee
  9. 02 7月, 2018 2 次提交
  10. 29 6月, 2018 1 次提交
  11. 13 6月, 2018 1 次提交
    • B
      Revert "net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets" · cdb8744d
      Bart Van Assche 提交于
      Revert the patch mentioned in the subject because it breaks at least
      the Avahi mDNS daemon. That patch namely causes the Ubuntu 18.04 Avahi
      daemon to fail to start:
      
      Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Successfully called chroot().
      Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Successfully dropped remaining capabilities.
      Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: No service file found in /etc/avahi/services.
      Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: SO_REUSEADDR failed: Structure needs cleaning
      Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: SO_REUSEADDR failed: Structure needs cleaning
      Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Failed to create server: No suitable network protocol available
      Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: avahi-daemon 0.7 exiting.
      Jun 12 09:49:24 ubuntu-vm systemd[1]: avahi-daemon.service: Main process exited, code=exited, status=255/n/a
      Jun 12 09:49:24 ubuntu-vm systemd[1]: avahi-daemon.service: Failed with result 'exit-code'.
      Jun 12 09:49:24 ubuntu-vm systemd[1]: Failed to start Avahi mDNS/DNS-SD Stack.
      
      Fixes: f396922d ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets")
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdb8744d
  12. 05 6月, 2018 1 次提交
    • M
      net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets · f396922d
      Maciej Żenczykowski 提交于
      It is not safe to do so because such sockets are already in the
      hash tables and changing these options can result in invalidating
      the tb->fastreuse(port) caching.
      
      This can have later far reaching consequences wrt. bind conflict checks
      which rely on these caches (for optimization purposes).
      
      Not to mention that you can currently end up with two identical
      non-reuseport listening sockets bound to the same local ip:port
      by clearing reuseport on them after they've already both been bound.
      
      There is unfortunately no EISBOUND error or anything similar,
      and EISCONN seems to be misleading for a bound-but-not-connected
      socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
      is the closest you can get to meaning 'socket in bad state'.
      (although perhaps EINVAL wouldn't be a bad choice either?)
      
      This does unfortunately run the risk of breaking buggy
      userspace programs...
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f396922d
  13. 26 5月, 2018 1 次提交
  14. 19 5月, 2018 1 次提交
    • E
      sock_diag: fix use-after-free read in __sk_free · 9709020c
      Eric Dumazet 提交于
      We must not call sock_diag_has_destroy_listeners(sk) on a socket
      that has no reference on net structure.
      
      BUG: KASAN: use-after-free in sock_diag_has_destroy_listeners include/linux/sock_diag.h:75 [inline]
      BUG: KASAN: use-after-free in __sk_free+0x329/0x340 net/core/sock.c:1609
      Read of size 8 at addr ffff88018a02e3a0 by task swapper/1/0
      
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.17.0-rc5+ #54
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1b9/0x294 lib/dump_stack.c:113
       print_address_description+0x6c/0x20b mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
       sock_diag_has_destroy_listeners include/linux/sock_diag.h:75 [inline]
       __sk_free+0x329/0x340 net/core/sock.c:1609
       sk_free+0x42/0x50 net/core/sock.c:1623
       sock_put include/net/sock.h:1664 [inline]
       reqsk_free include/net/request_sock.h:116 [inline]
       reqsk_put include/net/request_sock.h:124 [inline]
       inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:672 [inline]
       reqsk_timer_handler+0xe27/0x10e0 net/ipv4/inet_connection_sock.c:739
       call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
       expire_timers kernel/time/timer.c:1363 [inline]
       __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
       run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
       __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
       invoke_softirq kernel/softirq.c:365 [inline]
       irq_exit+0x1d1/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:525 [inline]
       smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
       </IRQ>
      RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
      RSP: 0018:ffff8801d9ae7c38 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
      RAX: dffffc0000000000 RBX: 1ffff1003b35cf8a RCX: 0000000000000000
      RDX: 1ffffffff11a30d0 RSI: 0000000000000001 RDI: ffffffff88d18680
      RBP: ffff8801d9ae7c38 R08: ffffed003b5e46c3 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      R13: ffff8801d9ae7cf0 R14: ffffffff897bef20 R15: 0000000000000000
       arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
       default_idle+0xc2/0x440 arch/x86/kernel/process.c:354
       arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:345
       default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
       cpuidle_idle_call kernel/sched/idle.c:153 [inline]
       do_idle+0x395/0x560 kernel/sched/idle.c:262
       cpu_startup_entry+0x104/0x120 kernel/sched/idle.c:368
       start_secondary+0x426/0x5b0 arch/x86/kernel/smpboot.c:269
       secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242
      
      Allocated by task 4557:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
       kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
       kmem_cache_zalloc include/linux/slab.h:691 [inline]
       net_alloc net/core/net_namespace.c:383 [inline]
       copy_net_ns+0x159/0x4c0 net/core/net_namespace.c:423
       create_new_namespaces+0x69d/0x8f0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc3/0x1f0 kernel/nsproxy.c:206
       ksys_unshare+0x708/0xf90 kernel/fork.c:2408
       __do_sys_unshare kernel/fork.c:2476 [inline]
       __se_sys_unshare kernel/fork.c:2474 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2474
       do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 69:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
       kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
       __cache_free mm/slab.c:3498 [inline]
       kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
       net_free net/core/net_namespace.c:399 [inline]
       net_drop_ns.part.14+0x11a/0x130 net/core/net_namespace.c:406
       net_drop_ns net/core/net_namespace.c:405 [inline]
       cleanup_net+0x6a1/0xb20 net/core/net_namespace.c:541
       process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
       worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
       kthread+0x345/0x410 kernel/kthread.c:240
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      The buggy address belongs to the object at ffff88018a02c140
       which belongs to the cache net_namespace of size 8832
      The buggy address is located 8800 bytes inside of
       8832-byte region [ffff88018a02c140, ffff88018a02e3c0)
      The buggy address belongs to the page:
      page:ffffea0006280b00 count:1 mapcount:0 mapping:ffff88018a02c140 index:0x0 compound_mapcount: 0
      flags: 0x2fffc0000008100(slab|head)
      raw: 02fffc0000008100 ffff88018a02c140 0000000000000000 0000000100000001
      raw: ffffea00062a1320 ffffea0006268020 ffff8801d9bdde40 0000000000000000
      page dumped because: kasan: bad access detected
      
      Fixes: b922622e ("sock_diag: don't broadcast kernel sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Craig Gallek <kraig@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9709020c
  15. 16 5月, 2018 1 次提交
  16. 11 5月, 2018 1 次提交
  17. 04 5月, 2018 1 次提交
  18. 17 4月, 2018 1 次提交
    • E
      tcp: fix SO_RCVLOWAT and RCVBUF autotuning · d1361840
      Eric Dumazet 提交于
      Applications might use SO_RCVLOWAT on TCP socket hoping to receive
      one [E]POLLIN event only when a given amount of bytes are ready in socket
      receive queue.
      
      Problem is that receive autotuning is not aware of this constraint,
      meaning sk_rcvbuf might be too small to allow all bytes to be stored.
      
      Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
      can override the default setsockopt(SO_RCVLOWAT) behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1361840
  19. 28 3月, 2018 1 次提交
  20. 27 3月, 2018 1 次提交
  21. 20 3月, 2018 2 次提交
  22. 15 3月, 2018 1 次提交
  23. 12 3月, 2018 1 次提交
    • X
      sock_diag: request _diag module only when the family or proto has been registered · bf2ae2e4
      Xin Long 提交于
      Now when using 'ss' in iproute, kernel would try to load all _diag
      modules, which also causes corresponding family and proto modules
      to be loaded as well due to module dependencies.
      
      Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
      would be loaded.
      
      For example:
      
        $ lsmod|grep sctp
        $ ss
        $ lsmod|grep sctp
        sctp_diag              16384  0
        sctp                  323584  5 sctp_diag
        inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
      
      As these family and proto modules are loaded unintentionally, it
      could cause some problems, like:
      
      - Some debug tools use 'ss' to collect the socket info, which loads all
        those diag and family and protocol modules. It's noisy for identifying
        issues.
      
      - Users usually expect to drop sctp init packet silently when they
        have no sense of sctp protocol instead of sending abort back.
      
      - It wastes resources (especially with multiple netns), and SCTP module
        can't be unloaded once it's loaded.
      
      ...
      
      In short, it's really inappropriate to have these family and proto
      modules loaded unexpectedly when just doing debugging with inet_diag.
      
      This patch is to introduce sock_load_diag_module() where it loads
      the _diag module only when it's corresponding family or proto has
      been already registered.
      
      Note that we can't just load _diag module without the family or
      proto loaded, as some symbols used in _diag module are from the
      family or proto module.
      
      v1->v2:
        - move inet proto check to inet_diag to avoid a compiling err.
      v2->v3:
        - define sock_load_diag_module in sock.c and export one symbol
          only.
        - improve the changelog.
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NPhil Sutter <phil@nwl.cc>
      Acked-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf2ae2e4
  24. 08 3月, 2018 1 次提交
  25. 22 2月, 2018 1 次提交
    • E
      tcp: switch to GSO being always on · 0a6b2a1d
      Eric Dumazet 提交于
      Oleksandr Natalenko reported performance issues with BBR without FQ
      packet scheduler that were root caused to lack of SG and GSO/TSO on
      his configuration.
      
      In this mode, TCP internal pacing has to setup a high resolution timer
      for each MSS sent.
      
      We could implement in TCP a strategy similar to the one adopted
      in commit fefa569a ("net_sched: sch_fq: account for schedule/timers drifts")
      or decide to finally switch TCP stack to a GSO only mode.
      
      This has many benefits :
      
      1) Most TCP developments are done with TSO in mind.
      2) Less high-resolution timers needs to be armed for TCP-pacing
      3) GSO can benefit of xmit_more hint
      4) Receiver GRO is more effective (as if TSO was used for real on sender)
         -> Lower ACK traffic
      5) Write queues have less overhead (one skb holds about 64KB of payload)
      6) SACK coalescing just works.
      7) rtx rb-tree contains less packets, SACK is cheaper.
      
      This patch implements the minimum patch, but we can remove some legacy
      code as follow ups.
      
      Tested:
      
      On 40Gbit link, one netperf -t TCP_STREAM
      
      BBR+fq:
      sg on:  26 Gbits/sec
      sg off: 15.7 Gbits/sec   (was 2.3 Gbit before patch)
      
      BBR+pfifo_fast:
      sg on:  24.2 Gbits/sec
      sg off: 14.9 Gbits/sec  (was 0.66 Gbit before patch !!! )
      
      BBR+fq_codel:
      sg on:  24.4 Gbits/sec
      sg off: 15 Gbits/sec  (was 0.66 Gbit before patch !!! )
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a6b2a1d
  26. 17 2月, 2018 1 次提交
  27. 13 2月, 2018 1 次提交