1. 03 4月, 2019 12 次提交
    • E
      tcp: do not use ipv6 header for ipv4 flow · 7115df61
      Eric Dumazet 提交于
      [ Upstream commit 89e4130939a20304f4059ab72179da81f5347528 ]
      
      When a dual stack tcp listener accepts an ipv4 flow,
      it should not attempt to use an ipv6 header or tcp_v6_iif() helper.
      
      Fixes: 1397ed35 ("ipv6: add flowinfo for tcp6 pkt_options for all cases")
      Fixes: df3687ff ("ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7115df61
    • X
      sctp: use memdup_user instead of vmemdup_user · cab576f1
      Xin Long 提交于
      [ Upstream commit ef82bcfa671b9a635bab5fa669005663d8b177c5 ]
      
      In sctp_setsockopt_bindx()/__sctp_setsockopt_connectx(), it allocates
      memory with addrs_size which is passed from userspace. We used flag
      GFP_USER to put some more restrictions on it in Commit cacc0621
      ("sctp: use GFP_USER for user-controlled kmalloc").
      
      However, since Commit c981f254 ("sctp: use vmemdup_user() rather
      than badly open-coding memdup_user()"), vmemdup_user() has been used,
      which doesn't check GFP_USER flag when goes to vmalloc_*(). So when
      addrs_size is a huge value, it could exhaust memory and even trigger
      oom killer.
      
      This patch is to use memdup_user() instead, in which GFP_USER would
      work to limit the memory allocation with a huge addrs_size.
      
      Note we can't fix it by limiting 'addrs_size', as there's no demand
      for it from RFC.
      
      Reported-by: syzbot+ec1b7575afef85a0e5ca@syzkaller.appspotmail.com
      Fixes: c981f254 ("sctp: use vmemdup_user() rather than badly open-coding memdup_user()")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cab576f1
    • M
      packets: Always register packet sk in the same order · 69cea7cf
      Maxime Chevallier 提交于
      [ Upstream commit a4dc6a49156b1f8d6e17251ffda17c9e6a5db78a ]
      
      When using fanouts with AF_PACKET, the demux functions such as
      fanout_demux_cpu will return an index in the fanout socket array, which
      corresponds to the selected socket.
      
      The ordering of this array depends on the order the sockets were added
      to a given fanout group, so for FANOUT_CPU this means sockets are bound
      to cpus in the order they are configured, which is OK.
      
      However, when stopping then restarting the interface these sockets are
      bound to, the sockets are reassigned to the fanout group in the reverse
      order, due to the fact that they were inserted at the head of the
      interface's AF_PACKET socket list.
      
      This means that traffic that was directed to the first socket in the
      fanout group is now directed to the last one after an interface restart.
      
      In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
      socket that used to receive traffic from the last CPU after an interface
      restart.
      
      This commit introduces a helper to add a socket at the tail of a list,
      then uses it to register AF_PACKET sockets.
      
      Note that this changes the order in which sockets are listed in /proc and
      with sock_diag.
      
      Fixes: dc99f600 ("packet: Add fanout support")
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69cea7cf
    • Y
      net-sysfs: call dev_hold if kobject_init_and_add success · d9d215be
      YueHaibing 提交于
      [ Upstream commit a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e ]
      
      In netdev_queue_add_kobject and rx_queue_add_kobject,
      if sysfs_create_group failed, kobject_put will call
      netdev_queue_release to decrease dev refcont, however
      dev_hold has not be called. So we will see this while
      unregistering dev:
      
      unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: d0d66837 ("net: don't decrement kobj reference count on init failure")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9d215be
    • E
      net: rose: fix a possible stack overflow · 7eeb12ed
      Eric Dumazet 提交于
      [ Upstream commit e5dcc0c3223c45c94100f05f28d8ef814db3d82c ]
      
      rose_write_internal() uses a temp buffer of 100 bytes, but a manual
      inspection showed that given arbitrary input, rose_create_facilities()
      can fill up to 110 bytes.
      
      Lets use a tailroom of 256 bytes for peace of mind, and remove
      the bounce buffer : we can simply allocate a big enough skb
      and adjust its length as needed.
      
      syzbot report :
      
      BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:352 [inline]
      BUG: KASAN: stack-out-of-bounds in rose_create_facilities net/rose/rose_subr.c:521 [inline]
      BUG: KASAN: stack-out-of-bounds in rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116
      Write of size 7 at addr ffff88808b1ffbef by task syz-executor.0/24854
      
      CPU: 0 PID: 24854 Comm: syz-executor.0 Not tainted 5.0.0+ #97
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x38/0x50 mm/kasan/common.c:131
       memcpy include/linux/string.h:352 [inline]
       rose_create_facilities net/rose/rose_subr.c:521 [inline]
       rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116
       rose_connect+0x7cb/0x1510 net/rose/af_rose.c:826
       __sys_connect+0x266/0x330 net/socket.c:1685
       __do_sys_connect net/socket.c:1696 [inline]
       __se_sys_connect net/socket.c:1693 [inline]
       __x64_sys_connect+0x73/0xb0 net/socket.c:1693
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458079
      Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f47b8d9dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458079
      RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000004
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f47b8d9e6d4
      R13: 00000000004be4a4 R14: 00000000004ceca8 R15: 00000000ffffffff
      
      The buggy address belongs to the page:
      page:ffffea00022c7fc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1fffc0000000000()
      raw: 01fffc0000000000 0000000000000000 ffffffff022c0101 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808b1ffa80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88808b1ffb00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 03
      >ffff88808b1ffb80: f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 04 f3
                                                                   ^
       ffff88808b1ffc00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88808b1ffc80: 00 00 00 00 00 00 00 f1 f1 f1 f1 f1 f1 01 f2 01
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7eeb12ed
    • C
      net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec · 85ef72d8
      Christoph Paasch 提交于
      [ Upstream commit 398f0132c14754fcd03c1c4f8e7176d001ce8ea1 ]
      
      Since commit fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
      one can now allocate packet ring buffers >= UINT_MAX. However, syzkaller
      found that that triggers a warning:
      
      [   21.100000] WARNING: CPU: 2 PID: 2075 at mm/page_alloc.c:4584 __alloc_pages_nod0
      [   21.101490] Modules linked in:
      [   21.101921] CPU: 2 PID: 2075 Comm: syz-executor.0 Not tainted 5.0.0 #146
      [   21.102784] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
      [   21.103887] RIP: 0010:__alloc_pages_nodemask+0x2a0/0x630
      [   21.104640] Code: fe ff ff 65 48 8b 04 25 c0 de 01 00 48 05 90 0f 00 00 41 bd 01 00 00 00 48 89 44 24 48 e9 9c fe 3
      [   21.107121] RSP: 0018:ffff88805e1cf920 EFLAGS: 00010246
      [   21.107819] RAX: 0000000000000000 RBX: ffffffff85a488a0 RCX: 0000000000000000
      [   21.108753] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000000
      [   21.109699] RBP: 1ffff1100bc39f28 R08: ffffed100bcefb67 R09: ffffed100bcefb67
      [   21.110646] R10: 0000000000000001 R11: ffffed100bcefb66 R12: 000000000000000d
      [   21.111623] R13: 0000000000000000 R14: ffff88805e77d888 R15: 000000000000000d
      [   21.112552] FS:  00007f7c7de05700(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000
      [   21.113612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   21.114405] CR2: 000000000065c000 CR3: 000000005e58e006 CR4: 00000000001606e0
      [   21.115367] Call Trace:
      [   21.115705]  ? __alloc_pages_slowpath+0x21c0/0x21c0
      [   21.116362]  alloc_pages_current+0xac/0x1e0
      [   21.116923]  kmalloc_order+0x18/0x70
      [   21.117393]  kmalloc_order_trace+0x18/0x110
      [   21.117949]  packet_set_ring+0x9d5/0x1770
      [   21.118524]  ? packet_rcv_spkt+0x440/0x440
      [   21.119094]  ? lock_downgrade+0x620/0x620
      [   21.119646]  ? __might_fault+0x177/0x1b0
      [   21.120177]  packet_setsockopt+0x981/0x2940
      [   21.120753]  ? __fget+0x2fb/0x4b0
      [   21.121209]  ? packet_release+0xab0/0xab0
      [   21.121740]  ? sock_has_perm+0x1cd/0x260
      [   21.122297]  ? selinux_secmark_relabel_packet+0xd0/0xd0
      [   21.123013]  ? __fget+0x324/0x4b0
      [   21.123451]  ? selinux_netlbl_socket_setsockopt+0x101/0x320
      [   21.124186]  ? selinux_netlbl_sock_rcv_skb+0x3a0/0x3a0
      [   21.124908]  ? __lock_acquire+0x529/0x3200
      [   21.125453]  ? selinux_socket_setsockopt+0x5d/0x70
      [   21.126075]  ? __sys_setsockopt+0x131/0x210
      [   21.126533]  ? packet_release+0xab0/0xab0
      [   21.127004]  __sys_setsockopt+0x131/0x210
      [   21.127449]  ? kernel_accept+0x2f0/0x2f0
      [   21.127911]  ? ret_from_fork+0x8/0x50
      [   21.128313]  ? do_raw_spin_lock+0x11b/0x280
      [   21.128800]  __x64_sys_setsockopt+0xba/0x150
      [   21.129271]  ? lockdep_hardirqs_on+0x37f/0x560
      [   21.129769]  do_syscall_64+0x9f/0x450
      [   21.130182]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      We should allocate with __GFP_NOWARN to handle this.
      
      Cc: Kal Conley <kal.conley@dectris.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Fixes: fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85ef72d8
    • P
      net: datagram: fix unbounded loop in __skb_try_recv_datagram() · 88c64f9c
      Paolo Abeni 提交于
      [ Upstream commit 0b91bce1ebfc797ff3de60c8f4a1e6219a8a3187 ]
      
      Christoph reported a stall while peeking datagram with an offset when
      busy polling is enabled. __skb_try_recv_datagram() uses as the loop
      termination condition 'queue empty'. When peeking, the socket
      queue can be not empty, even when no additional packets are received.
      
      Address the issue explicitly checking for receive queue changes,
      as currently done by __skb_wait_for_more_packets().
      
      Fixes: 2b5cd0df ("net: Change return type of sk_busy_loop from bool to void")
      Reported-and-tested-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88c64f9c
    • X
      ipv6: make ip6_create_rt_rcu return ip6_null_entry instead of NULL · be092113
      Xin Long 提交于
      [ Upstream commit 1c87e79a002f6a159396138cd3f3ab554a2a8887 ]
      
      Jianlin reported a crash:
      
        [  381.484332] BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
        [  381.619802] RIP: 0010:fib6_rule_lookup+0xa3/0x160
        [  382.009615] Call Trace:
        [  382.020762]  <IRQ>
        [  382.030174]  ip6_route_redirect.isra.52+0xc9/0xf0
        [  382.050984]  ip6_redirect+0xb6/0xf0
        [  382.066731]  icmpv6_notify+0xca/0x190
        [  382.083185]  ndisc_redirect_rcv+0x10f/0x160
        [  382.102569]  ndisc_rcv+0xfb/0x100
        [  382.117725]  icmpv6_rcv+0x3f2/0x520
        [  382.133637]  ip6_input_finish+0xbf/0x460
        [  382.151634]  ip6_input+0x3b/0xb0
        [  382.166097]  ipv6_rcv+0x378/0x4e0
      
      It was caused by the lookup function __ip6_route_redirect() returns NULL in
      fib6_rule_lookup() when ip6_create_rt_rcu() returns NULL.
      
      So we fix it by simply making ip6_create_rt_rcu() return ip6_null_entry
      instead of NULL.
      
      v1->v2:
        - move down 'fallback:' to make it more readable.
      
      Fixes: e873e4b9 ("ipv6: use fib6_info_hold_safe() when necessary")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be092113
    • Y
      genetlink: Fix a memory leak on error path · 9b8ef421
      YueHaibing 提交于
      [ Upstream commit ceabee6c59943bdd5e1da1a6a20dc7ee5f8113a2 ]
      
      In genl_register_family(), when idr_alloc() fails,
      we forget to free the memory we possibly allocate for
      family->attrbuf.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: 2ae0f17d ("genetlink: use idr to track families")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b8ef421
    • E
      dccp: do not use ipv6 header for ipv4 flow · 321461f2
      Eric Dumazet 提交于
      [ Upstream commit e0aa67709f89d08c8d8e5bdd9e0b649df61d0090 ]
      
      When a dual stack dccp listener accepts an ipv4 flow,
      it should not attempt to use an ipv6 header or
      inet6_iif() helper.
      
      Fixes: 3df80d93 ("[DCCP]: Introduce DCCPv6")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      321461f2
    • M
      Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer · 15d6538a
      Marcel Holtmann 提交于
      commit 7c9cbd0b5e38a1672fcd137894ace3b042dfbf69 upstream.
      
      The function l2cap_get_conf_opt will return L2CAP_CONF_OPT_SIZE + opt->len
      as length value. The opt->len however is in control over the remote user
      and can be used by an attacker to gain access beyond the bounds of the
      actual packet.
      
      To prevent any potential leak of heap memory, it is enough to check that
      the resulting len calculation after calling l2cap_get_conf_opt is not
      below zero. A well formed packet will always return >= 0 here and will
      end with the length value being zero after the last option has been
      parsed. In case of malformed packets messing with the opt->len field the
      length value will become negative. If that is the case, then just abort
      and ignore the option.
      
      In case an attacker uses a too short opt->len value, then garbage will
      be parsed, but that is protected by the unknown option handling and also
      the option parameter size checks.
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15d6538a
    • M
      Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt · 2318c0e4
      Marcel Holtmann 提交于
      commit af3d5d1c87664a4f150fcf3534c6567cb19909b0 upstream.
      
      When doing option parsing for standard type values of 1, 2 or 4 octets,
      the value is converted directly into a variable instead of a pointer. To
      avoid being tricked into being a pointer, check that for these option
      types that sizes actually match. In L2CAP every option is fixed size and
      thus it is prudent anyway to ensure that the remote side sends us the
      right option size along with option paramters.
      
      If the option size is not matching the option type, then that option is
      silently ignored. It is a protocol violation and instead of trying to
      give the remote attacker any further hints just pretend that option is
      not present and proceed with the default values. Implementation
      following the specification and its qualification procedures will always
      use the correct size and thus not being impacted here.
      
      To keep the code readable and consistent accross all options, a few
      cosmetic changes were also required.
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2318c0e4
  2. 27 3月, 2019 3 次提交
  3. 24 3月, 2019 14 次提交
    • J
      svcrpc: fix UDP on servers with lots of threads · 43bceddc
      J. Bruce Fields 提交于
      commit b7e5034cbecf5a65b7bfdc2b20a8378039577706 upstream.
      
      James Pearson found that an NFS server stopped responding to UDP
      requests if started with more than 1017 threads.
      
      sv_max_mesg is about 2^20, so that is probably where the calculation
      performed by
      
      	svc_sock_setbufsize(svsk->sk_sock,
                                  (serv->sv_nrthreads+3) * serv->sv_max_mesg,
                                  (serv->sv_nrthreads+3) * serv->sv_max_mesg);
      
      starts to overflow an int.
      Reported-by: NJames Pearson <jcpearson@gmail.com>
      Tested-by: NJames Pearson <jcpearson@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43bceddc
    • W
      bpf: only test gso type on gso packets · 9920eb40
      Willem de Bruijn 提交于
      [ Upstream commit 4c3024debf62de4c6ac6d3cb4c0063be21d4f652 ]
      
      BPF can adjust gso only for tcp bytestreams. Fail on other gso types.
      
      But only on gso packets. It does not touch this field if !gso_size.
      
      Fixes: b90efd225874 ("bpf: only adjust gso_size on bytestream protocols")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9920eb40
    • A
      netfilter: ipt_CLUSTERIP: fix warning unused variable cn · 0d98ecb1
      Anders Roxell 提交于
      commit 206b8cc514d7ff2b79dd2d5ad939adc7c493f07a upstream.
      
      When CONFIG_PROC_FS isn't set the variable cn isn't used.
      
      net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’:
      net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable]
        struct clusterip_net *cn = clusterip_pernet(net);
                              ^~
      
      Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS".
      
      Fixes: b12f7bad5ad3 ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d98ecb1
    • A
      phonet: fix building with clang · ee01ac61
      Arnd Bergmann 提交于
      [ Upstream commit 6321aa197547da397753757bd84c6ce64b3e3d89 ]
      
      clang warns about overflowing the data[] member in the struct pnpipehdr:
      
      net/phonet/pep.c:295:8: warning: array index 4 is past the end of the array (which contains 1 element) [-Warray-bounds]
                              if (hdr->data[4] == PEP_IND_READY)
                                  ^         ~
      include/net/phonet/pep.h:66:3: note: array 'data' declared here
                      u8              data[1];
      
      Using a flexible array member at the end of the struct avoids the
      warning, but since we cannot have a flexible array member inside
      of the union, each index now has to be moved back by one, which
      makes it a little uglier.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NRémi Denis-Courmont <remi@remlab.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ee01ac61
    • T
      xfrm: Fix inbound traffic via XFRM interfaces across network namespaces · 6ac400b7
      Tobias Brunner 提交于
      [ Upstream commit 660899ddf06ae8bb5bbbd0a19418b739375430c5 ]
      
      After moving an XFRM interface to another namespace it stays associated
      with the original namespace (net in `struct xfrm_if` and the list keyed
      with `xfrmi_net_id`), allowing processes in the new namespace to use
      SAs/policies that were created in the original namespace.  For instance,
      this allows a keying daemon in one namespace to establish IPsec SAs for
      other namespaces without processes there having access to the keys or IKE
      credentials.
      
      This worked fine for outbound traffic, however, for inbound traffic the
      lookup for the interfaces and the policies used the incorrect namespace
      (the one the XFRM interface was moved to).
      
      Fixes: f203b76d ("xfrm: Add virtual xfrm interfaces")
      Signed-off-by: NTobias Brunner <tobias@strongswan.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6ac400b7
    • S
      af_key: unconditionally clone on broadcast · 978e0388
      Sean Tranchetti 提交于
      [ Upstream commit fc2d5cfdcfe2ab76b263d91429caa22451123085 ]
      
      Attempting to avoid cloning the skb when broadcasting by inflating
      the refcount with sock_hold/sock_put while under RCU lock is dangerous
      and violates RCU principles. It leads to subtle race conditions when
      attempting to free the SKB, as we may reference sockets that have
      already been freed by the stack.
      
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
      [006b6b6b6b6b6c4b] address between user and kernel address ranges
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      task: fffffff78f65b380 task.stack: ffffff8049a88000
      pc : sock_rfree+0x38/0x6c
      lr : skb_release_head_state+0x6c/0xcc
      Process repro (pid: 7117, stack limit = 0xffffff8049a88000)
      Call trace:
      	sock_rfree+0x38/0x6c
      	skb_release_head_state+0x6c/0xcc
      	skb_release_all+0x1c/0x38
      	__kfree_skb+0x1c/0x30
      	kfree_skb+0xd0/0xf4
      	pfkey_broadcast+0x14c/0x18c
      	pfkey_sendmsg+0x1d8/0x408
      	sock_sendmsg+0x44/0x60
      	___sys_sendmsg+0x1d0/0x2a8
      	__sys_sendmsg+0x64/0xb4
      	SyS_sendmsg+0x34/0x4c
      	el0_svc_naked+0x34/0x38
      Kernel panic - not syncing: Fatal exception
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      978e0388
    • W
      bpf: only adjust gso_size on bytestream protocols · 413e3985
      Willem de Bruijn 提交于
      [ Upstream commit b90efd2258749e04e1b3f71ef0d716f2ac2337e0 ]
      
      bpf_skb_change_proto and bpf_skb_adjust_room change skb header length.
      For GSO packets they adjust gso_size to maintain the same MTU.
      
      The gso size can only be safely adjusted on bytestream protocols.
      Commit d02f51cb ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat
      to deal with gso sctp skbs") excluded SKB_GSO_SCTP.
      
      Since then type SKB_GSO_UDP_L4 has been added, whose contents are one
      gso_size unit per datagram. Also exclude these.
      
      Move from a blacklist to a whitelist check to future proof against
      additional such new GSO types, e.g., for fraglist based GRO.
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      413e3985
    • M
      esp: Skip TX bytes accounting when sending from a request socket · b92eaed3
      Martin Willi 提交于
      [ Upstream commit 09db51241118aeb06e1c8cd393b45879ce099b36 ]
      
      On ESP output, sk_wmem_alloc is incremented for the added padding if a
      socket is associated to the skb. When replying with TCP SYNACKs over
      IPsec, the associated sk is a casted request socket, only. Increasing
      sk_wmem_alloc on a request socket results in a write at an arbitrary
      struct offset. In the best case, this produces the following WARNING:
      
      WARNING: CPU: 1 PID: 0 at lib/refcount.c:102 esp_output_head+0x2e4/0x308 [esp4]
      refcount_t: addition on 0; use-after-free.
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc3 #2
      Hardware name: Marvell Armada 380/385 (Device Tree)
      [...]
      [<bf0ff354>] (esp_output_head [esp4]) from [<bf1006a4>] (esp_output+0xb8/0x180 [esp4])
      [<bf1006a4>] (esp_output [esp4]) from [<c05dee64>] (xfrm_output_resume+0x558/0x664)
      [<c05dee64>] (xfrm_output_resume) from [<c05d07b0>] (xfrm4_output+0x44/0xc4)
      [<c05d07b0>] (xfrm4_output) from [<c05956bc>] (tcp_v4_send_synack+0xa8/0xe8)
      [<c05956bc>] (tcp_v4_send_synack) from [<c0586ad8>] (tcp_conn_request+0x7f4/0x948)
      [<c0586ad8>] (tcp_conn_request) from [<c058c404>] (tcp_rcv_state_process+0x2a0/0xe64)
      [<c058c404>] (tcp_rcv_state_process) from [<c05958ac>] (tcp_v4_do_rcv+0xf0/0x1f4)
      [<c05958ac>] (tcp_v4_do_rcv) from [<c0598a4c>] (tcp_v4_rcv+0xdb8/0xe20)
      [<c0598a4c>] (tcp_v4_rcv) from [<c056eb74>] (ip_protocol_deliver_rcu+0x2c/0x2dc)
      [<c056eb74>] (ip_protocol_deliver_rcu) from [<c056ee6c>] (ip_local_deliver_finish+0x48/0x54)
      [<c056ee6c>] (ip_local_deliver_finish) from [<c056eecc>] (ip_local_deliver+0x54/0xec)
      [<c056eecc>] (ip_local_deliver) from [<c056efac>] (ip_rcv+0x48/0xb8)
      [<c056efac>] (ip_rcv) from [<c0519c2c>] (__netif_receive_skb_one_core+0x50/0x6c)
      [...]
      
      The issue triggers only when not using TCP syncookies, as for syncookies
      no socket is associated.
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b92eaed3
    • N
      xprtrdma: Make sure Send CQ is allocated on an existing compvec · d84bc704
      Nicolas Morey-Chaisemartin 提交于
      [ Upstream commit a4cb5bdb754afe21f3e9e7164213e8600cf69427 ]
      
      Make sure the device has at least 2 completion vectors
      before allocating to compvec#1
      
      Fixes: a4699f56 (xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode)
      Signed-off-by: NNicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d84bc704
    • A
      ipvs: fix dependency on nf_defrag_ipv6 · 5ca2ef67
      Andrea Claudi 提交于
      [ Upstream commit 098e13f5b21d3398065fce8780f07a3ef62f4812 ]
      
      ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation,
      but lacks proper Kconfig dependencies and does not explicitly
      request defrag features.
      
      As a result, if netfilter hooks are not loaded, when IPv6 fragmented
      packet are handled by ipvs only the first fragment makes through.
      
      Fix it properly declaring the dependency on Kconfig and registering
      netfilter hooks on ip_vs_add_service() and ip_vs_new_dest().
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NAndrea Claudi <aclaudi@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5ca2ef67
    • F
      netfilter: compat: initialize all fields in xt_init · e0e6b0d7
      Francesco Ruggeri 提交于
      [ Upstream commit 8d29d16d21342a0c86405d46de0c4ac5daf1760f ]
      
      If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init
      time, the following panic can be caused by running
      
      % ebtables -t broute -F BROUTING
      
      from a 32-bit user level on a 64-bit kernel. This patch replaces
      kmalloc_array with kcalloc when allocating xt.
      
      [  474.680846] BUG: unable to handle kernel paging request at 0000000009600920
      [  474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0
      [  474.693838] Oops: 0000 [#1] SMP
      [  474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1
      [  474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013
      [  474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables]
      [  474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d
      [  474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207
      [  474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249
      [  474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124
      [  474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f
      [  474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001
      [  474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8
      [  474.780234] FS:  0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700
      [  474.788612] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      [  474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0
      [  474.802052] Call Trace:
      [  474.804789]  compat_do_replace+0x1fb/0x2a3 [ebtables]
      [  474.810105]  compat_do_ebt_set_ctl+0x69/0xe6 [ebtables]
      [  474.815605]  ? try_module_get+0x37/0x42
      [  474.819716]  compat_nf_setsockopt+0x4f/0x6d
      [  474.824172]  compat_ip_setsockopt+0x7e/0x8c
      [  474.828641]  compat_raw_setsockopt+0x16/0x3a
      [  474.833220]  compat_sock_common_setsockopt+0x1d/0x24
      [  474.838458]  __compat_sys_setsockopt+0x17e/0x1b1
      [  474.843343]  ? __check_object_size+0x76/0x19a
      [  474.847960]  __ia32_compat_sys_socketcall+0x1cb/0x25b
      [  474.853276]  do_fast_syscall_32+0xaf/0xf6
      [  474.857548]  entry_SYSENTER_compat+0x6b/0x7a
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e0e6b0d7
    • I
      mac80211: Fix Tx aggregation session tear down with ITXQs · a5a24445
      Ilan Peer 提交于
      [ Upstream commit 6157ca0d6bfe437691b1e98a62e2efe12b6714da ]
      
      When mac80211 requests the low level driver to stop an ongoing
      Tx aggregation, the low level driver is expected to call
      ieee80211_stop_tx_ba_cb_irqsafe() to indicate that it is ready
      to stop the session. The callback in turn schedules a worker
      to complete the session tear down, which in turn also handles
      the relevant state for the intermediate Tx queue.
      
      However, as this flow in asynchronous, the intermediate queue
      should be stopped and not continue servicing frames, as in
      such a case frames that are dequeued would be marked as part
      of an aggregation, although the aggregation is already been
      stopped.
      
      Fix this by stopping the intermediate Tx queue, before
      calling the low level driver to stop the Tx aggregation.
      Signed-off-by: NIlan Peer <ilan.peer@intel.com>
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a5a24445
    • J
      mac80211: call drv_ibss_join() on restart · bff33ba4
      Johannes Berg 提交于
      [ Upstream commit 4926b51bfaa6d36bd6f398fb7698679d3962e19d ]
      
      If a driver does any significant activity in its ibss_join method,
      then it will very well expect that to be called during restart,
      before any stations are added. Do that.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bff33ba4
    • Z
      9p/net: fix memory leak in p9_client_create · 85bdc9da
      zhengbin 提交于
      commit bb06c388fa20ae24cfe80c52488de718a7e3a53f upstream.
      
      If msize is less than 4096, we should close and put trans, destroy
      tagpool, not just free client. This patch fixes that.
      
      Link: http://lkml.kernel.org/m/1552464097-142659-1-git-send-email-zhengbin13@huawei.com
      Cc: stable@vger.kernel.org
      Fixes: 574d356b7a02 ("9p/net: put a lower bound on msize")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NDominique Martinet <dominique.martinet@cea.fr>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85bdc9da
  4. 19 3月, 2019 11 次提交
    • V
      net: sched: flower: insert new filter to idr after setting its mask · 275a2c08
      Vlad Buslov 提交于
      [ Upstream commit ecb3dea400d3beaf611ce76ac7a51d4230492cf2 ]
      
      When adding new filter to flower classifier, fl_change() inserts it to
      handle_idr before initializing filter extensions and assigning it a mask.
      Normally this ordering doesn't matter because all flower classifier ops
      callbacks assume rtnl lock protection. However, when filter has an action
      that doesn't have its kernel module loaded, rtnl lock is released before
      call to request_module(). During this time the filter can be accessed bu
      concurrent task before its initialization is completed, which can lead to a
      crash.
      
      Example case of NULL pointer dereference in concurrent dump:
      
      Task 1                           Task 2
      
      tc_new_tfilter()
       fl_change()
        idr_alloc_u32(fnew)
        fl_set_parms()
         tcf_exts_validate()
          tcf_action_init()
           tcf_action_init_1()
            rtnl_unlock()
            request_module()
            ...                        rtnl_lock()
            				 tc_dump_tfilter()
            				  tcf_chain_dump()
      				   fl_walk()
      				    idr_get_next_ul()
      				    tcf_node_dump()
      				     tcf_fill_node()
      				      fl_dump()
      				       mask = &f->mask->key; <- NULL ptr
            rtnl_lock()
      
      Extension initialization and mask assignment don't depend on fnew->handle
      that is allocated by idr_alloc_u32(). Move idr allocation code after action
      creation and mask assignment in fl_change() to prevent concurrent access
      to not fully initialized filter when rtnl lock is released to load action
      module.
      
      Fixes: 01683a14 ("net: sched: refactor flower walk to iterate over idr")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      275a2c08
    • A
      missing barriers in some of unix_sock ->addr and ->path accesses · 345af5ab
      Al Viro 提交于
      [ Upstream commit ae3b564179bfd06f32d051b9e5d72ce4b2a07c37 ]
      
      Several u->addr and u->path users are not holding any locks in
      common with unix_bind().  unix_state_lock() is useless for those
      purposes.
      
      u->addr is assign-once and *(u->addr) is fully set up by the time
      we set u->addr (all under unix_table_lock).  u->path is also
      set in the same critical area, also before setting u->addr, and
      any unix_sock with ->path filled will have non-NULL ->addr.
      
      So setting ->addr with smp_store_release() is all we need for those
      "lockless" users - just have them fetch ->addr with smp_load_acquire()
      and don't even bother looking at ->path if they see NULL ->addr.
      
      Users of ->addr and ->path fall into several classes now:
          1) ones that do smp_load_acquire(u->addr) and access *(u->addr)
      and u->path only if smp_load_acquire() has returned non-NULL.
          2) places holding unix_table_lock.  These are guaranteed that
      *(u->addr) is seen fully initialized.  If unix_sock is in one of the
      "bound" chains, so's ->path.
          3) unix_sock_destructor() using ->addr is safe.  All places
      that set u->addr are guaranteed to have seen all stores *(u->addr)
      while holding a reference to u and unix_sock_destructor() is called
      when (atomic) refcount hits zero.
          4) unix_release_sock() using ->path is safe.  unix_bind()
      is serialized wrt unix_release() (normally - by struct file
      refcount), and for the instances that had ->path set by unix_bind()
      unix_release_sock() comes from unix_release(), so they are fine.
      Instances that had it set in unix_stream_connect() either end up
      attached to a socket (in unix_accept()), in which case the call
      chain to unix_release_sock() and serialization are the same as in
      the previous case, or they never get accept'ed and unix_release_sock()
      is called when the listener is shut down and its queue gets purged.
      In that case the listener's queue lock provides the barriers needed -
      unix_stream_connect() shoves our unix_sock into listener's queue
      under that lock right after having set ->path and eventual
      unix_release_sock() caller picks them from that queue under the
      same lock right before calling unix_release_sock().
          5) unix_find_other() use of ->path is pointless, but safe -
      it happens with successful lookup by (abstract) name, so ->path.dentry
      is guaranteed to be NULL there.
      earlier-variant-reviewed-by: N"Paul E. McKenney" <paulmck@linux.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      345af5ab
    • U
      net/smc: fix smc_poll in SMC_INIT state · f56b3c29
      Ursula Braun 提交于
      [ Upstream commit d7cf4a3bf3a83c977a29055e1c4ffada7697b31f ]
      
      smc_poll() returns with mask bit EPOLLPRI if the connection urg_state
      is SMC_URG_VALID. Since SMC_URG_VALID is zero, smc_poll signals
      EPOLLPRI errorneously if called in state SMC_INIT before the connection
      is created, for instance in a non-blocking connect scenario.
      
      This patch switches to non-zero values for the urg states.
      Reviewed-by: NKarsten Graul <kgraul@linux.ibm.com>
      Fixes: de8474eb ("net/smc: urgent data support")
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f56b3c29
    • P
      ipv6: route: enforce RCU protection in ip6_route_check_nh_onlink() · 2e4b2aeb
      Paolo Abeni 提交于
      [ Upstream commit bf1dc8bad1d42287164d216d8efb51c5cd381b18 ]
      
      We need a RCU critical section around rt6_info->from deference, and
      proper annotation.
      
      Fixes: 4ed591c8ab44 ("net/ipv6: Allow onlink routes to have a device mismatch if it is the default route")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e4b2aeb
    • P
      ipv6: route: enforce RCU protection in rt6_update_exception_stamp_rt() · 96dd4ef3
      Paolo Abeni 提交于
      [ Upstream commit 193f3685d0546b0cea20c99894aadb70098e47bf ]
      
      We must access rt6_info->from under RCU read lock: move the
      dereference under such lock, with proper annotation.
      
      v1 -> v2:
       - avoid using multiple, racy, fetch operations for rt->from
      
      Fixes: a68886a6 ("net/ipv6: Make from in rt6_info rcu protected")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96dd4ef3
    • P
      ipv6: route: purge exception on removal · b9d0cb75
      Paolo Abeni 提交于
      [ Upstream commit f5b51fe804ec2a6edce0f8f6b11ea57283f5857b ]
      
      When a netdevice is unregistered, we flush the relevant exception
      via rt6_sync_down_dev() -> fib6_ifdown() -> fib6_del() -> fib6_del_route().
      
      Finally, we end-up calling rt6_remove_exception(), where we release
      the relevant dst, while we keep the references to the related fib6_info and
      dev. Such references should be released later when the dst will be
      destroyed.
      
      There are a number of caches that can keep the exception around for an
      unlimited amount of time - namely dst_cache, possibly even socket cache.
      As a result device registration may hang, as demonstrated by this script:
      
      ip netns add cl
      ip netns add rt
      ip netns add srv
      ip netns exec rt sysctl -w net.ipv6.conf.all.forwarding=1
      
      ip link add name cl_veth type veth peer name cl_rt_veth
      ip link set dev cl_veth netns cl
      ip -n cl link set dev cl_veth up
      ip -n cl addr add dev cl_veth 2001::2/64
      ip -n cl route add default via 2001::1
      
      ip -n cl link add tunv6 type ip6tnl mode ip6ip6 local 2001::2 remote 2002::1 hoplimit 64 dev cl_veth
      ip -n cl link set tunv6 up
      ip -n cl addr add 2013::2/64 dev tunv6
      
      ip link set dev cl_rt_veth netns rt
      ip -n rt link set dev cl_rt_veth up
      ip -n rt addr add dev cl_rt_veth 2001::1/64
      
      ip link add name rt_srv_veth type veth peer name srv_veth
      ip link set dev srv_veth netns srv
      ip -n srv link set dev srv_veth up
      ip -n srv addr add dev srv_veth 2002::1/64
      ip -n srv route add default via 2002::2
      
      ip -n srv link add tunv6 type ip6tnl mode ip6ip6 local 2002::1 remote 2001::2 hoplimit 64 dev srv_veth
      ip -n srv link set tunv6 up
      ip -n srv addr add 2013::1/64 dev tunv6
      
      ip link set dev rt_srv_veth netns rt
      ip -n rt link set dev rt_srv_veth up
      ip -n rt addr add dev rt_srv_veth 2002::2/64
      
      ip netns exec srv netserver & sleep 0.1
      ip netns exec cl ping6 -c 4 2013::1
      ip netns exec cl netperf -H 2013::1 -t TCP_STREAM -l 3 & sleep 1
      ip -n rt link set dev rt_srv_veth mtu 1400
      wait %2
      
      ip -n cl link del cl_veth
      
      This commit addresses the issue purging all the references held by the
      exception at time, as we currently do for e.g. ipv6 pcpu dst entries.
      
      v1 -> v2:
       - re-order the code to avoid accessing dst and net after dst_dev_put()
      
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9d0cb75
    • K
      net: Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255 · fe38cbc9
      Kalash Nainwal 提交于
      [ Upstream commit 97f0082a0592212fc15d4680f5a4d80f79a1687c ]
      
      Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255 to
      keep legacy software happy. This is similar to what was done for
      ipv4 in commit 709772e6 ("net: Fix routing tables with
      id > 255 for legacy software").
      Signed-off-by: NKalash Nainwal <kalash@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe38cbc9
    • E
      net/x25: fix a race in x25_bind() · 13b43057
      Eric Dumazet 提交于
      [ Upstream commit 797a22bd5298c2674d927893f46cadf619dad11d ]
      
      syzbot was able to trigger another soft lockup [1]
      
      I first thought it was the O(N^2) issue I mentioned in my
      prior fix (f657d22ee1f "net/x25: do not hold the cpu
      too long in x25_new_lci()"), but I eventually found
      that x25_bind() was not checking SOCK_ZAPPED state under
      socket lock protection.
      
      This means that multiple threads can end up calling
      x25_insert_socket() for the same socket, and corrupt x25_list
      
      [1]
      watchdog: BUG: soft lockup - CPU#0 stuck for 123s! [syz-executor.2:10492]
      Modules linked in:
      irq event stamp: 27515
      hardirqs last  enabled at (27514): [<ffffffff81006673>] trace_hardirqs_on_thunk+0x1a/0x1c
      hardirqs last disabled at (27515): [<ffffffff8100668f>] trace_hardirqs_off_thunk+0x1a/0x1c
      softirqs last  enabled at (32): [<ffffffff8632ee73>] x25_get_neigh+0xa3/0xd0 net/x25/x25_link.c:336
      softirqs last disabled at (34): [<ffffffff86324bc3>] x25_find_socket+0x23/0x140 net/x25/af_x25.c:341
      CPU: 0 PID: 10492 Comm: syz-executor.2 Not tainted 5.0.0-rc7+ #88
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__sanitizer_cov_trace_pc+0x4/0x50 kernel/kcov.c:97
      Code: f4 ff ff ff e8 11 9f ea ff 48 c7 05 12 fb e5 08 00 00 00 00 e9 c8 e9 ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 <48> 8b 75 08 65 48 8b 04 25 40 ee 01 00 65 8b 15 38 0c 92 7e 81 e2
      RSP: 0018:ffff88806e94fc48 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      RAX: 1ffff1100d84dac5 RBX: 0000000000000001 RCX: ffffc90006197000
      RDX: 0000000000040000 RSI: ffffffff86324bf3 RDI: ffff88806c26d628
      RBP: ffff88806e94fc48 R08: ffff88806c1c6500 R09: fffffbfff1282561
      R10: fffffbfff1282560 R11: ffffffff89412b03 R12: ffff88806c26d628
      R13: ffff888090455200 R14: dffffc0000000000 R15: 0000000000000000
      FS:  00007f3a107e4700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3a107e3db8 CR3: 00000000a5544000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __x25_find_socket net/x25/af_x25.c:327 [inline]
       x25_find_socket+0x7d/0x140 net/x25/af_x25.c:342
       x25_new_lci net/x25/af_x25.c:355 [inline]
       x25_connect+0x380/0xde0 net/x25/af_x25.c:784
       __sys_connect+0x266/0x330 net/socket.c:1662
       __do_sys_connect net/socket.c:1673 [inline]
       __se_sys_connect net/socket.c:1670 [inline]
       __x64_sys_connect+0x73/0xb0 net/socket.c:1670
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457e29
      Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f3a107e3c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e29
      RDX: 0000000000000012 RSI: 0000000020000200 RDI: 0000000000000005
      RBP: 000000000073c040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f3a107e46d4
      R13: 00000000004be362 R14: 00000000004ceb98 R15: 00000000ffffffff
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 10493 Comm: syz-executor.3 Not tainted 5.0.0-rc7+ #88
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:193 [inline]
      RIP: 0010:queued_write_lock_slowpath+0x143/0x290 kernel/locking/qrwlock.c:86
      Code: 4c 8d 2c 01 41 83 c7 03 41 0f b6 45 00 41 38 c7 7c 08 84 c0 0f 85 0c 01 00 00 8b 03 3d 00 01 00 00 74 1a f3 90 41 0f b6 55 00 <41> 38 d7 7c eb 84 d2 74 e7 48 89 df e8 cc aa 4e 00 eb dd be 04 00
      RSP: 0018:ffff888085c47bd8 EFLAGS: 00000206
      RAX: 0000000000000300 RBX: ffffffff89412b00 RCX: 1ffffffff1282560
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff89412b00
      RBP: ffff888085c47c70 R08: 1ffffffff1282560 R09: fffffbfff1282561
      R10: fffffbfff1282560 R11: ffffffff89412b03 R12: 00000000000000ff
      R13: fffffbfff1282560 R14: 1ffff11010b88f7d R15: 0000000000000003
      FS:  00007fdd04086700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fdd04064db8 CR3: 0000000090be0000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       queued_write_lock include/asm-generic/qrwlock.h:104 [inline]
       do_raw_write_lock+0x1d6/0x290 kernel/locking/spinlock_debug.c:203
       __raw_write_lock_bh include/linux/rwlock_api_smp.h:204 [inline]
       _raw_write_lock_bh+0x3b/0x50 kernel/locking/spinlock.c:312
       x25_insert_socket+0x21/0xe0 net/x25/af_x25.c:267
       x25_bind+0x273/0x340 net/x25/af_x25.c:703
       __sys_bind+0x23f/0x290 net/socket.c:1481
       __do_sys_bind net/socket.c:1492 [inline]
       __se_sys_bind net/socket.c:1490 [inline]
       __x64_sys_bind+0x73/0xb0 net/socket.c:1490
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457e29
      
      Fixes: 90c27297 ("X.25 remove bkl in bind")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: andrew hendry <andrew.hendry@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13b43057
    • G
      tcp: handle inet_csk_reqsk_queue_add() failures · 173e9023
      Guillaume Nault 提交于
      [  Upstream commit 9d3e1368bb45893a75a5dfb7cd21fdebfa6b47af ]
      
      Commit 7716682c ("tcp/dccp: fix another race at listener
      dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted
      {tcp,dccp}_check_req() accordingly. However, TFO and syncookies
      weren't modified, thus leaking allocated resources on error.
      
      Contrary to tcp_check_req(), in both syncookies and TFO cases,
      we need to drop the request socket. Also, since the child socket is
      created with inet_csk_clone_lock(), we have to unlock it and drop an
      extra reference (->sk_refcount is initially set to 2 and
      inet_csk_reqsk_queue_add() drops only one ref).
      
      For TFO, we also need to revert the work done by tcp_try_fastopen()
      (with reqsk_fastopen_remove()).
      
      Fixes: 7716682c ("tcp/dccp: fix another race at listener dismantle")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      173e9023
    • C
      tcp: Don't access TCP_SKB_CB before initializing it · fba43f49
      Christoph Paasch 提交于
      [ Upstream commit f2feaefdabb0a6253aa020f65e7388f07a9ed47c ]
      
      Since commit eeea10b8 ("tcp: add
      tcp_v4_fill_cb()/tcp_v4_restore_cb()"), tcp_vX_fill_cb is only called
      after tcp_filter(). That means, TCP_SKB_CB(skb)->end_seq still points to
      the IP-part of the cb.
      
      We thus should not mock with it, as this can trigger bugs (thanks
      syzkaller):
      [   12.349396] ==================================================================
      [   12.350188] BUG: KASAN: slab-out-of-bounds in ip6_datagram_recv_specific_ctl+0x19b3/0x1a20
      [   12.351035] Read of size 1 at addr ffff88006adbc208 by task test_ip6_datagr/1799
      
      Setting end_seq is actually no more necessary in tcp_filter as it gets
      initialized later on in tcp_vX_fill_cb.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: eeea10b8 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fba43f49
    • S
      tcp: do not report TCP_CM_INQ of 0 for closed connections · 8accd04e
      Soheil Hassas Yeganeh 提交于
      [ Upstream commit 6466e715651f9f358e60c5ea4880e4731325827f ]
      
      Returning 0 as inq to userspace indicates there is no more data to
      read, and the application needs to wait for EPOLLIN. For a connection
      that has received FIN from the remote peer, however, the application
      must continue reading until getting EOF (return value of 0
      from tcp_recvmsg) or an error, if edge-triggered epoll (EPOLLET) is
      being used. Otherwise, the application will never receive a new
      EPOLLIN, since there is no epoll edge after the FIN.
      
      Return 1 when there is no data left on the queue but the
      connection has received FIN, so that the applications continue
      reading.
      
      Fixes: b75eba76 (tcp: send in-queue bytes in cmsg upon read)
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8accd04e