1. 19 10月, 2018 4 次提交
    • S
      ip6_tunnel: Fix encapsulation layout · d4d576f5
      Stefano Brivio 提交于
      Commit 058214a4 ("ip6_tun: Add infrastructure for doing
      encapsulation") added the ip6_tnl_encap() call in ip6_tnl_xmit(), before
      the call to ipv6_push_frag_opts() to append the IPv6 Tunnel Encapsulation
      Limit option (option 4, RFC 2473, par. 5.1) to the outer IPv6 header.
      
      As long as the option didn't actually end up in generated packets, this
      wasn't an issue. Then commit 89a23c8b ("ip6_tunnel: Fix missing tunnel
      encapsulation limit option") fixed sending of this option, and the
      resulting layout, e.g. for FoU, is:
      
      .-------------------.------------.----------.-------------------.----- - -
      | Outer IPv6 Header | UDP header | Option 4 | Inner IPv6 Header | Payload
      '-------------------'------------'----------'-------------------'----- - -
      
      Needless to say, FoU and GUE (at least) won't work over IPv6. The option
      is appended by default, and I couldn't find a way to disable it with the
      current iproute2.
      
      Turn this into a more reasonable:
      
      .-------------------.----------.------------.-------------------.----- - -
      | Outer IPv6 Header | Option 4 | UDP header | Inner IPv6 Header | Payload
      '-------------------'----------'------------'-------------------'----- - -
      
      With this, and with 84dad559 ("udp6: fix encap return code for
      resubmitting"), FoU and GUE work again over IPv6.
      
      Fixes: 058214a4 ("ip6_tun: Add infrastructure for doing encapsulation")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4d576f5
    • J
      tipc: fix info leak from kernel tipc_event · b06f9d9f
      Jon Maloy 提交于
      We initialize a struct tipc_event allocated on the kernel stack to
      zero to avert info leak to user space.
      
      Reported-by: syzbot+057458894bc8cada4dee@syzkaller.appspotmail.com
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b06f9d9f
    • W
      net: socket: fix a missing-check bug · b6168562
      Wenwen Wang 提交于
      In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
      statement to see whether it is necessary to pre-process the ethtool
      structure, because, as mentioned in the comment, the structure
      ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
      is allocated through compat_alloc_user_space(). One thing to note here is
      that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
      partially determined by 'rule_cnt', which is actually acquired from the
      user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
      get_user(). After 'rxnfc' is allocated, the data in the original user-space
      buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
      including the 'rule_cnt' field. However, after this copy, no check is
      re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
      race to change the value in the 'compat_rxnfc->rule_cnt' between these two
      copies. Through this way, the attacker can bypass the previous check on
      'rule_cnt' and inject malicious data. This can cause undefined behavior of
      the kernel and introduce potential security risk.
      
      This patch avoids the above issue via copying the value acquired by
      get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6168562
    • P
      net: sched: Fix for duplicate class dump · 3c53ed8f
      Phil Sutter 提交于
      When dumping classes by parent, kernel would return classes twice:
      
      | # tc qdisc add dev lo root prio
      | # tc class show dev lo
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | # tc class show dev lo parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      
      This comes from qdisc_match_from_root() potentially returning the root
      qdisc itself if its handle matched. Though in that case, root's classes
      were already dumped a few lines above.
      
      Fixes: cb395b20 ("net: sched: optimize class dumps")
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c53ed8f
  2. 18 10月, 2018 5 次提交
    • N
      net: ipmr: fix unresolved entry dumps · eddf016b
      Nikolay Aleksandrov 提交于
      If the skb space ends in an unresolved entry while dumping we'll miss
      some unresolved entries. The reason is due to zeroing the entry counter
      between dumping resolved and unresolved mfc entries. We should just
      keep counting until the whole table is dumped and zero when we move to
      the next as we have a separate table counter.
      Reported-by: NColin Ian King <colin.king@canonical.com>
      Fixes: 8fb472c0 ("ipmr: improve hash scalability")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eddf016b
    • P
      udp6: fix encap return code for resubmitting · 84dad559
      Paolo Abeni 提交于
      The commit eb63f296 ("udp6: add missing checks on edumux packet
      processing") used the same return code convention of the ipv4 counterpart,
      but ipv6 uses the opposite one: positive values means resubmit.
      
      This change addresses the issue, using positive return value for
      resubmitting. Also update the related comment, which was broken, too.
      
      Fixes: eb63f296 ("udp6: add missing checks on edumux packet processing")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84dad559
    • X
      sctp: not free the new asoc when sctp_wait_for_connect returns err · c863850c
      Xin Long 提交于
      When sctp_wait_for_connect is called to wait for connect ready
      for sp->strm_interleave in sctp_sendmsg_to_asoc, a panic could
      be triggered if cpu is scheduled out and the new asoc is freed
      elsewhere, as it will return err and later the asoc gets freed
      again in sctp_sendmsg.
      
      [  285.840764] list_del corruption, ffff9f0f7b284078->next is LIST_POISON1 (dead000000000100)
      [  285.843590] WARNING: CPU: 1 PID: 8861 at lib/list_debug.c:47 __list_del_entry_valid+0x50/0xa0
      [  285.846193] Kernel panic - not syncing: panic_on_warn set ...
      [  285.846193]
      [  285.848206] CPU: 1 PID: 8861 Comm: sctp_ndata Kdump: loaded Not tainted 4.19.0-rc7.label #584
      [  285.850559] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  285.852164] Call Trace:
      ...
      [  285.872210]  ? __list_del_entry_valid+0x50/0xa0
      [  285.872894]  sctp_association_free+0x42/0x2d0 [sctp]
      [  285.873612]  sctp_sendmsg+0x5a4/0x6b0 [sctp]
      [  285.874236]  sock_sendmsg+0x30/0x40
      [  285.874741]  ___sys_sendmsg+0x27a/0x290
      [  285.875304]  ? __switch_to_asm+0x34/0x70
      [  285.875872]  ? __switch_to_asm+0x40/0x70
      [  285.876438]  ? ptep_set_access_flags+0x2a/0x30
      [  285.877083]  ? do_wp_page+0x151/0x540
      [  285.877614]  __sys_sendmsg+0x58/0xa0
      [  285.878138]  do_syscall_64+0x55/0x180
      [  285.878669]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is a similar issue with the one fixed in Commit ca3af4dd
      ("sctp: do not free asoc when it is already dead in sctp_sendmsg").
      But this one can't be fixed by returning -ESRCH for the dead asoc
      in sctp_wait_for_connect, as it will break sctp_connect's return
      value to users.
      
      This patch is to simply set err to -ESRCH before it returns to
      sctp_sendmsg when any err is returned by sctp_wait_for_connect
      for sp->strm_interleave, so that no asoc would be freed due to
      this.
      
      When users see this error, they will know the packet hasn't been
      sent. And it also makes sense to not free asoc because waiting
      connect fails, like the second call for sctp_wait_for_connect in
      sctp_sendmsg_to_asoc.
      
      Fixes: 668c9beb ("sctp: implement assign_number for sctp_stream_interleave")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c863850c
    • M
      sctp: fix race on sctp_id2asoc · b336deca
      Marcelo Ricardo Leitner 提交于
      syzbot reported an use-after-free involving sctp_id2asoc.  Dmitry Vyukov
      helped to root cause it and it is because of reading the asoc after it
      was freed:
      
              CPU 1                       CPU 2
      (working on socket 1)            (working on socket 2)
      	                         sctp_association_destroy
      sctp_id2asoc
         spin lock
           grab the asoc from idr
         spin unlock
                                         spin lock
      				     remove asoc from idr
      				   spin unlock
      				   free(asoc)
         if asoc->base.sk != sk ... [*]
      
      This can only be hit if trying to fetch asocs from different sockets. As
      we have a single IDR for all asocs, in all SCTP sockets, their id is
      unique on the system. An application can try to send stuff on an id
      that matches on another socket, and the if in [*] will protect from such
      usage. But it didn't consider that as that asoc may belong to another
      socket, it may be freed in parallel (read: under another socket lock).
      
      We fix it by moving the checks in [*] into the protected region. This
      fixes it because the asoc cannot be freed while the lock is held.
      
      Reported-by: syzbot+c7dd55d7aec49d48e49a@syzkaller.appspotmail.com
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b336deca
    • T
      net: bpfilter: use get_pid_task instead of pid_task · 84258438
      Taehee Yoo 提交于
      pid_task() dereferences rcu protected tasks array.
      But there is no rcu_read_lock() in shutdown_umh() routine so that
      rcu_read_lock() is needed.
      get_pid_task() is wrapper function of pid_task. it holds rcu_read_lock()
      then calls pid_task(). if task isn't NULL, it increases reference count
      of task.
      
      test commands:
         %modprobe bpfilter
         %modprobe -rv bpfilter
      
      splat looks like:
      [15102.030932] =============================
      [15102.030957] WARNING: suspicious RCU usage
      [15102.030985] 4.19.0-rc7+ #21 Not tainted
      [15102.031010] -----------------------------
      [15102.031038] kernel/pid.c:330 suspicious rcu_dereference_check() usage!
      [15102.031063]
      	       other info that might help us debug this:
      
      [15102.031332]
      	       rcu_scheduler_active = 2, debug_locks = 1
      [15102.031363] 1 lock held by modprobe/1570:
      [15102.031389]  #0: 00000000580ef2b0 (bpfilter_lock){+.+.}, at: stop_umh+0x13/0x52 [bpfilter]
      [15102.031552]
                     stack backtrace:
      [15102.031583] CPU: 1 PID: 1570 Comm: modprobe Not tainted 4.19.0-rc7+ #21
      [15102.031607] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [15102.031628] Call Trace:
      [15102.031676]  dump_stack+0xc9/0x16b
      [15102.031723]  ? show_regs_print_info+0x5/0x5
      [15102.031801]  ? lockdep_rcu_suspicious+0x117/0x160
      [15102.031855]  pid_task+0x134/0x160
      [15102.031900]  ? find_vpid+0xf0/0xf0
      [15102.032017]  shutdown_umh.constprop.1+0x1e/0x53 [bpfilter]
      [15102.032055]  stop_umh+0x46/0x52 [bpfilter]
      [15102.032092]  __x64_sys_delete_module+0x47e/0x570
      [ ... ]
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84258438
  3. 17 10月, 2018 1 次提交
  4. 16 10月, 2018 13 次提交
    • D
      rxrpc: Fix a missing rxrpc_put_peer() in the error_report handler · 1890fea7
      David Howells 提交于
      Fix a missing call to rxrpc_put_peer() on the main path through the
      rxrpc_error_report() function.  This manifests itself as a ref leak
      whenever an ICMP packet or other error comes in.
      
      In commit f3344303, the hand-off of the ref to a work item was removed
      and was not replaced with a put.
      
      Fixes: f3344303 ("rxrpc: Fix error distribution")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1890fea7
    • X
      sctp: use the pmtu from the icmp packet to update transport pathmtu · d805397c
      Xin Long 提交于
      Other than asoc pmtu sync from all transports, sctp_assoc_sync_pmtu
      is also processing transport pmtu_pending by icmp packets. But it's
      meaningless to use sctp_dst_mtu(t->dst) as new pmtu for a transport.
      
      The right pmtu value should come from the icmp packet, and it would
      be saved into transport->mtu_info in this patch and used later when
      the pmtu sync happens in sctp_sendmsg_to_asoc or sctp_packet_config.
      
      Besides, without this patch, as pmtu can only be updated correctly
      when receiving a icmp packet and no place is holding sock lock, it
      will take long time if the sock is busy with sending packets.
      
      Note that it doesn't process transport->mtu_info in .release_cb(),
      as there is no enough information for pmtu update, like for which
      asoc or transport. It is not worth traversing all asocs to check
      pmtu_pending. So unlike tcp, sctp does this in tx path, for which
      mtu_info needs to be atomic_t.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d805397c
    • E
      ipv6: mcast: fix a use-after-free in inet6_mc_check · dc012f36
      Eric Dumazet 提交于
      syzbot found a use-after-free in inet6_mc_check [1]
      
      The problem here is that inet6_mc_check() uses rcu
      and read_lock(&iml->sflock)
      
      So the fact that ip6_mc_leave_src() is called under RTNL
      and the socket lock does not help us, we need to acquire
      iml->sflock in write mode.
      
      In the future, we should convert all this stuff to RCU.
      
      [1]
      BUG: KASAN: use-after-free in ipv6_addr_equal include/net/ipv6.h:521 [inline]
      BUG: KASAN: use-after-free in inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649
      Read of size 8 at addr ffff8801ce7f2510 by task syz-executor0/22432
      
      CPU: 1 PID: 22432 Comm: syz-executor0 Not tainted 4.19.0-rc7+ #280
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
       ipv6_addr_equal include/net/ipv6.h:521 [inline]
       inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649
       __raw_v6_lookup+0x320/0x3f0 net/ipv6/raw.c:98
       ipv6_raw_deliver net/ipv6/raw.c:183 [inline]
       raw6_local_deliver+0x3d3/0xcb0 net/ipv6/raw.c:240
       ip6_input_finish+0x467/0x1aa0 net/ipv6/ip6_input.c:345
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip6_input+0xe9/0x600 net/ipv6/ip6_input.c:426
       ip6_mc_input+0x48a/0xd20 net/ipv6/ip6_input.c:503
       dst_input include/net/dst.h:450 [inline]
       ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ipv6_rcv+0x120/0x640 net/ipv6/ip6_input.c:271
       __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
       netif_receive_skb_internal+0x12c/0x620 net/core/dev.c:5126
       napi_frags_finish net/core/dev.c:5664 [inline]
       napi_gro_frags+0x75a/0xc90 net/core/dev.c:5737
       tun_get_user+0x3189/0x4250 drivers/net/tun.c:1923
       tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1968
       call_write_iter include/linux/fs.h:1808 [inline]
       do_iter_readv_writev+0x8b0/0xa80 fs/read_write.c:680
       do_iter_write+0x185/0x5f0 fs/read_write.c:959
       vfs_writev+0x1f1/0x360 fs/read_write.c:1004
       do_writev+0x11a/0x310 fs/read_write.c:1039
       __do_sys_writev fs/read_write.c:1112 [inline]
       __se_sys_writev fs/read_write.c:1109 [inline]
       __x64_sys_writev+0x75/0xb0 fs/read_write.c:1109
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457421
      Code: 75 14 b8 14 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 34 b5 fb ff c3 48 83 ec 08 e8 1a 2d 00 00 48 89 04 24 b8 14 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 63 2d 00 00 48 89 d0 48 83 c4 08 48 3d 01
      RSP: 002b:00007f2d30ecaba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
      RAX: ffffffffffffffda RBX: 000000000000003e RCX: 0000000000457421
      RDX: 0000000000000001 RSI: 00007f2d30ecabf0 RDI: 00000000000000f0
      RBP: 0000000020000500 R08: 00000000000000f0 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000293 R12: 00007f2d30ecb6d4
      R13: 00000000004c4890 R14: 00000000004d7b90 R15: 00000000ffffffff
      
      Allocated by task 22437:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
       __do_kmalloc mm/slab.c:3718 [inline]
       __kmalloc+0x14e/0x760 mm/slab.c:3727
       kmalloc include/linux/slab.h:518 [inline]
       sock_kmalloc+0x15a/0x1f0 net/core/sock.c:1983
       ip6_mc_source+0x14dd/0x1960 net/ipv6/mcast.c:427
       do_ipv6_setsockopt.isra.9+0x3afb/0x45d0 net/ipv6/ipv6_sockglue.c:743
       ipv6_setsockopt+0xbd/0x170 net/ipv6/ipv6_sockglue.c:933
       rawv6_setsockopt+0x59/0x140 net/ipv6/raw.c:1069
       sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3038
       __sys_setsockopt+0x1ba/0x3c0 net/socket.c:1902
       __do_sys_setsockopt net/socket.c:1913 [inline]
       __se_sys_setsockopt net/socket.c:1910 [inline]
       __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1910
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 22430:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
       kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
       __cache_free mm/slab.c:3498 [inline]
       kfree+0xcf/0x230 mm/slab.c:3813
       __sock_kfree_s net/core/sock.c:2004 [inline]
       sock_kfree_s+0x29/0x60 net/core/sock.c:2010
       ip6_mc_leave_src+0x11a/0x1d0 net/ipv6/mcast.c:2448
       __ipv6_sock_mc_close+0x20b/0x4e0 net/ipv6/mcast.c:310
       ipv6_sock_mc_close+0x158/0x1d0 net/ipv6/mcast.c:328
       inet6_release+0x40/0x70 net/ipv6/af_inet6.c:452
       __sock_release+0xd7/0x250 net/socket.c:579
       sock_close+0x19/0x20 net/socket.c:1141
       __fput+0x385/0xa30 fs/file_table.c:278
       ____fput+0x15/0x20 fs/file_table.c:309
       task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
       tracehook_notify_resume include/linux/tracehook.h:193 [inline]
       exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
       prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
       syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
       do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8801ce7f2500
       which belongs to the cache kmalloc-192 of size 192
      The buggy address is located 16 bytes inside of
       192-byte region [ffff8801ce7f2500, ffff8801ce7f25c0)
      The buggy address belongs to the page:
      page:ffffea000739fc80 count:1 mapcount:0 mapping:ffff8801da800040 index:0x0
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffffea0006f6e548 ffffea000737b948 ffff8801da800040
      raw: 0000000000000000 ffff8801ce7f2000 0000000100000010 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801ce7f2400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8801ce7f2480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      >ffff8801ce7f2500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                               ^
       ffff8801ce7f2580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       ffff8801ce7f2600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc012f36
    • T
      tipc: fix unsafe rcu locking when accessing publication list · d3092b2e
      Tung Nguyen 提交于
      The binding table's 'cluster_scope' list is rcu protected to handle
      races between threads changing the list and those traversing the list at
      the same moment. We have now found that the function named_distribute()
      uses the regular list_for_each() macro to traverse the said list.
      Likewise, the function tipc_named_withdraw() is removing items from the
      same list using the regular list_del() call. When these two functions
      execute in parallel we see occasional crashes.
      
      This commit fixes this by adding the missing _rcu() suffixes.
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3092b2e
    • D
      rxrpc: Fix incorrect conditional on IPV6 · 7ec8dc96
      David Howells 提交于
      The udpv6_encap_enable() function is part of the ipv6 code, and if that is
      configured as a loadable module and rxrpc is built in then a build failure
      will occur because the conditional check is wrong:
      
        net/rxrpc/local_object.o: In function `rxrpc_lookup_local':
        local_object.c:(.text+0x2688): undefined reference to `udpv6_encap_enable'
      
      Use the correct config symbol (CONFIG_AF_RXRPC_IPV6) in the conditional
      check rather than CONFIG_IPV6 as that will do the right thing.
      
      Fixes: 5271953c ("rxrpc: Use the UDP encap_rcv hook")
      Reported-by: kbuild-all@01.org
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ec8dc96
    • S
      ipv6: rate-limit probes for neighbourless routes · f547fac6
      Sabrina Dubroca 提交于
      When commit 27097255 ("[IPV6]: ROUTE: Add Router Reachability
      Probing (RFC4191).") introduced router probing, the rt6_probe() function
      required that a neighbour entry existed. This neighbour entry is used to
      record the timestamp of the last probe via the ->updated field.
      
      Later, commit 2152caea ("ipv6: Do not depend on rt->n in rt6_probe().")
      removed the requirement for a neighbour entry. Neighbourless routes skip
      the interval check and are not rate-limited.
      
      This patch adds rate-limiting for neighbourless routes, by recording the
      timestamp of the last probe in the fib6_info itself.
      
      Fixes: 2152caea ("ipv6: Do not depend on rt->n in rt6_probe().")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f547fac6
    • Y
      rxrpc: use correct kvec num when sending BUSY response packet · d6672a5a
      YueHaibing 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      net/rxrpc/output.c: In function 'rxrpc_reject_packets':
      net/rxrpc/output.c:527:11: warning:
       variable 'ioc' set but not used [-Wunused-but-set-variable]
      
      'ioc' is the correct kvec num when sending a BUSY (or an ABORT) response
      packet.
      
      Fixes: ece64fec ("rxrpc: Emit BUSY packets when supposed to rather than ABORTs")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6672a5a
    • D
      rxrpc: Fix an uninitialised variable · d7b4c24f
      David Howells 提交于
      Fix an uninitialised variable introduced by the last patch.  This can cause
      a crash when a new call comes in to a local service, such as when an AFS
      fileserver calls back to the local cache manager.
      
      Fixes: c1e15b49 ("rxrpc: Fix the packet reception routine")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7b4c24f
    • J
      tipc: initialize broadcast link stale counter correctly · 4af00f4c
      Jon Maloy 提交于
      In the commit referred to below we added link tolerance as an additional
      criteria for declaring broadcast transmission "stale" and resetting the
      unicast links to the affected node.
      
      Unfortunately, this 'improvement' introduced two bugs, which each and
      one alone cause only limited problems, but combined lead to seemingly
      stochastic unicast link resets, depending on the amount of broadcast
      traffic transmitted.
      
      The first issue, a missing initialization of the 'tolerance' field of
      the receiver broadcast link, was recently fixed by commit 047491ea
      ("tipc: set link tolerance correctly in broadcast link").
      
      Ths second issue, where we omit to reset the 'stale_cnt' field of
      the same link after a 'stale' period is over, leads to this counter
      accumulating over time, and in the absence of the 'tolerance' criteria
      leads to the above described symptoms. This commit adds the missing
      initialization.
      
      Fixes: a4dc70d4 ("tipc: extend link reset criteria for stale packet retransmission")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4af00f4c
    • C
      llc: set SOCK_RCU_FREE in llc_sap_add_socket() · 5a8e7aea
      Cong Wang 提交于
      WHen an llc sock is added into the sk_laddr_hash of an llc_sap,
      it is not marked with SOCK_RCU_FREE.
      
      This causes that the sock could be freed while it is still being
      read by __llc_lookup_established() with RCU read lock. sock is
      refcounted, but with RCU read lock, nothing prevents the readers
      getting a zero refcnt.
      
      Fix it by setting SOCK_RCU_FREE in llc_sap_add_socket().
      
      Reported-by: syzbot+11e05f04c15e03be5254@syzkaller.appspotmail.com
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a8e7aea
    • D
      net/sched: cls_api: add missing validation of netlink attributes · e331473f
      Davide Caratti 提交于
      Similarly to what has been done in 8b4c3cdd ("net: sched: Add policy
      validation for tc attributes"), fix classifier code to add validation of
      TCA_CHAIN and TCA_KIND netlink attributes.
      
      tested with:
       # ./tdc.py -c filter
      
      v2: Let sch_api and cls_api share nla_policy they have in common, thanks
          to David Ahern.
      v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done
          by TC modules, thanks to Cong Wang.
          While at it, restore the 'Delete / get qdisc' comment to its orginal
          position, just above tc_get_qdisc() function prototype.
      
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e331473f
    • W
      ethtool: fix a privilege escalation bug · 58f5bbe3
      Wenwen Wang 提交于
      In dev_ethtool(), the eth command 'ethcmd' is firstly copied from the
      use-space buffer 'useraddr' and checked to see whether it is
      ETHTOOL_PERQUEUE. If yes, the sub-command 'sub_cmd' is further copied from
      the user space. Otherwise, 'sub_cmd' is the same as 'ethcmd'. Next,
      according to 'sub_cmd', a permission check is enforced through the function
      ns_capable(). For example, the permission check is required if 'sub_cmd' is
      ETHTOOL_SCOALESCE, but it is not necessary if 'sub_cmd' is
      ETHTOOL_GCOALESCE, as suggested in the comment "Allow some commands to be
      done by anyone". The following execution invokes different handlers
      according to 'ethcmd'. Specifically, if 'ethcmd' is ETHTOOL_PERQUEUE,
      ethtool_set_per_queue() is called. In ethtool_set_per_queue(), the kernel
      object 'per_queue_opt' is copied again from the user-space buffer
      'useraddr' and 'per_queue_opt.sub_command' is used to determine which
      operation should be performed. Given that the buffer 'useraddr' is in the
      user space, a malicious user can race to change the sub-command between the
      two copies. In particular, the attacker can supply ETHTOOL_PERQUEUE and
      ETHTOOL_GCOALESCE to bypass the permission check in dev_ethtool(). Then
      before ethtool_set_per_queue() is called, the attacker changes
      ETHTOOL_GCOALESCE to ETHTOOL_SCOALESCE. In this way, the attacker can
      bypass the permission check and execute ETHTOOL_SCOALESCE.
      
      This patch enforces a check in ethtool_set_per_queue() after the second
      copy from 'useraddr'. If the sub-command is different from the one obtained
      in the first copy in dev_ethtool(), an error code EINVAL will be returned.
      
      Fixes: f38d138a ("net/ethtool: support set coalesce per queue")
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58f5bbe3
    • W
      ethtool: fix a missing-check bug · 2bb3207d
      Wenwen Wang 提交于
      In ethtool_get_rxnfc(), the eth command 'cmd' is compared against
      'ETHTOOL_GRXFH' to see whether it is necessary to adjust the variable
      'info_size'. Then the whole structure of 'info' is copied from the
      user-space buffer 'useraddr' with 'info_size' bytes. In the following
      execution, 'info' may be copied again from the buffer 'useraddr' depending
      on the 'cmd' and the 'info.flow_type'. However, after these two copies,
      there is no check between 'cmd' and 'info.cmd'. In fact, 'cmd' is also
      copied from the buffer 'useraddr' in dev_ethtool(), which is the caller
      function of ethtool_get_rxnfc(). Given that 'useraddr' is in the user
      space, a malicious user can race to change the eth command in the buffer
      between these copies. By doing so, the attacker can supply inconsistent
      data and cause undefined behavior because in the following execution 'info'
      will be passed to ops->get_rxnfc().
      
      This patch adds a necessary check on 'info.cmd' and 'cmd' to confirm that
      they are still same after the two copies in ethtool_get_rxnfc(). Otherwise,
      an error code EINVAL will be returned.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb3207d
  5. 12 10月, 2018 1 次提交
    • Y
      tipc: eliminate possible recursive locking detected by LOCKDEP · a1f8dd34
      Ying Xue 提交于
      When booting kernel with LOCKDEP option, below warning info was found:
      
      WARNING: possible recursive locking detected
      4.19.0-rc7+ #14 Not tainted
      --------------------------------------------
      swapper/0/1 is trying to acquire lock:
      00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
      include/linux/spinlock.h:334 [inline]
      00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
      
      but task is already holding lock:
      00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
      include/linux/spinlock.h:334 [inline]
      00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&list->lock)->rlock#4);
        lock(&(&list->lock)->rlock#4);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      2 locks held by swapper/0/1:
       #0: 00000000f7539d34 (pernet_ops_rwsem){+.+.}, at:
      register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
       #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      spin_lock_bh include/linux/spinlock.h:334 [inline]
       #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
      
      stack backtrace:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1af/0x295 lib/dump_stack.c:113
       print_deadlock_bug kernel/locking/lockdep.c:1759 [inline]
       check_deadlock kernel/locking/lockdep.c:1803 [inline]
       validate_chain kernel/locking/lockdep.c:2399 [inline]
       __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
       lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
       __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
       _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
       spin_lock_bh include/linux/spinlock.h:334 [inline]
       tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
       tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
       tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
       tipc_init_net+0x472/0x610 net/tipc/core.c:82
       ops_init+0xf7/0x520 net/core/net_namespace.c:129
       __register_pernet_operations net/core/net_namespace.c:940 [inline]
       register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
       register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
       tipc_init+0x83/0x104 net/tipc/core.c:140
       do_one_initcall+0x109/0x70a init/main.c:885
       do_initcall_level init/main.c:953 [inline]
       do_initcalls init/main.c:961 [inline]
       do_basic_setup init/main.c:979 [inline]
       kernel_init_freeable+0x4bd/0x57f init/main.c:1144
       kernel_init+0x13/0x180 init/main.c:1063
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
      
      The reason why the noise above was complained by LOCKDEP is because we
      nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset
      function. In fact it's unnecessary to move skb buffer from l->wakeupq
      queue to l->inputq queue while holding the two locks at the same time.
      Instead, we can move skb buffers in l->wakeupq queue to a temporary
      list first and then move the buffers of the temporary list to l->inputq
      queue, which is also safe for us.
      
      Fixes: 3f32d0be ("tipc: lock wakeup & inputq at tipc_link_reset()")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1f8dd34
  6. 11 10月, 2018 13 次提交
    • F
      xfrm: policy: use hlist rcu variants on insert · 9dffff20
      Florian Westphal 提交于
      bydst table/list lookups use rcu, so insertions must use rcu versions.
      
      Fixes: a7c44247 ("xfrm: policy: make xfrm_policy_lookup_bytype lockless")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      9dffff20
    • A
      net/xfrm: fix out-of-bounds packet access · 9f7e43da
      Alexei Starovoitov 提交于
      BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0
      net/ipv6/xfrm6_policy.c:161
      Read of size 1 at addr ffff8801d882eec7 by task syz-executor1/6667
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
        print_address_description+0x6c/0x20b mm/kasan/report.c:256
        kasan_report_error mm/kasan/report.c:354 [inline]
        kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
        __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
        _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161
        __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299
        xfrm_decode_session include/net/xfrm.h:1232 [inline]
        vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542
        __netdev_start_xmit include/linux/netdevice.h:4313 [inline]
        netdev_start_xmit include/linux/netdevice.h:4322 [inline]
        xmit_one net/core/dev.c:3217 [inline]
        dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233
        __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803
        dev_queue_xmit+0x17/0x20 net/core/dev.c:3836
      
      Reported-by: syzbot+acffccec848dc13fe459@syzkaller.appspotmail.com
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      9f7e43da
    • B
      xsk: do not call synchronize_net() under RCU read lock · cee27167
      Björn Töpel 提交于
      The XSKMAP update and delete functions called synchronize_net(), which
      can sleep. It is not allowed to sleep during an RCU read section.
      
      Instead we need to make sure that the sock sk_destruct (xsk_destruct)
      function is asynchronously called after an RCU grace period. Setting
      the SOCK_RCU_FREE flag for XDP sockets takes care of this.
      
      Fixes: fbfc504a ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cee27167
    • P
      tipc: queue socket protocol error messages into socket receive buffer · e7eb0582
      Parthasarathy Bhuvaragan 提交于
      In tipc_sk_filter_rcv(), when we detect protocol messages with error we
      call tipc_sk_conn_proto_rcv() and let it reset the connection and notify
      the socket by calling sk->sk_state_change().
      
      However, tipc_sk_filter_rcv() may have been called from the function
      tipc_backlog_rcv(), in which case the socket lock is held and the socket
      already awake. This means that the sk_state_change() call is ignored and
      the error notification lost. Now the receive queue will remain empty and
      the socket sleeps forever.
      
      In this commit, we convert the protocol message into a connection abort
      message and enqueue it into the socket's receive queue. By this addition
      to the above state change we cover all conditions.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7eb0582
    • J
      tipc: set link tolerance correctly in broadcast link · 047491ea
      Jon Maloy 提交于
      In the patch referred to below we added link tolerance as an additional
      criteria for declaring broadcast transmission "stale" and resetting the
      affected links.
      
      However, the 'tolerance' field of the broadcast link is never set, and
      remains at zero. This renders the whole commit without the intended
      improving effect, but luckily also with no negative effect.
      
      In this commit we add the missing initialization.
      
      Fixes: a4dc70d4 ("tipc: extend link reset criteria for stale packet retransmission")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      047491ea
    • S
      net: ipv4: don't let PMTU updates increase route MTU · 28d35bcd
      Sabrina Dubroca 提交于
      When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
      received, we must clamp its value. However, we can receive a PMTU
      exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
      increase in PMTU.
      
      To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.
      
      Before this patch, in case of an update, the exception's MTU would
      always change. Now, an exception can have only its lock flag updated,
      but not the MTU, so we need to add a check on locking to the following
      "is this exception getting updated, or close to expiring?" test.
      
      Fixes: d52e5a7e ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d35bcd
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
    • M
      net/ipv6: stop leaking percpu memory in fib6 info · 7abab7b9
      Mike Rapoport 提交于
      The fib6_info_alloc() function allocates percpu memory to hold per CPU
      pointers to rt6_info, but this memory is never freed. Fix it.
      
      Fixes: a64efe14 ("net/ipv6: introduce fib6_info struct and helpers")
      Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7abab7b9
    • K
      rds: RDS (tcp) hangs on sendto() to unresponding address · 9a4890bd
      Ka-Cheong Poon 提交于
      In rds_send_mprds_hash(), if the calculated hash value is non-zero and
      the MPRDS connections are not yet up, it will wait.  But it should not
      wait if the send is non-blocking.  In this case, it should just use the
      base c_path for sending the message.
      Signed-off-by: NKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a4890bd
    • E
      net: make skb_partial_csum_set() more robust against overflows · 52b5d6f5
      Eric Dumazet 提交于
      syzbot managed to crash in skb_checksum_help() [1] :
      
              BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));
      
      Root cause is the following check in skb_partial_csum_set()
      
      	if (unlikely(start > skb_headlen(skb)) ||
      	    unlikely((int)start + off > skb_headlen(skb) - 2))
      		return false;
      
      If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff
      and the check fails to detect that ((int)start + off) is off the limit,
      since the compare is unsigned.
      
      When we fix that, then the first condition (start > skb_headlen(skb))
      becomes obsolete.
      
      Then we should also check that (skb_headroom(skb) + start) wont
      overflow 16bit field.
      
      [1]
      kernel BUG at net/core/dev.c:2880!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 7330 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #253
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:skb_checksum_help+0x9e3/0xbb0 net/core/dev.c:2880
      Code: 85 00 ff ff ff 48 c1 e8 03 42 80 3c 28 00 0f 84 09 fb ff ff 48 8b bd 00 ff ff ff e8 97 a8 b9 fb e9 f8 fa ff ff e8 2d 09 76 fb <0f> 0b 48 8b bd 28 ff ff ff e8 1f a8 b9 fb e9 b1 f6 ff ff 48 89 cf
      RSP: 0018:ffff8801d83a6f60 EFLAGS: 00010293
      RAX: ffff8801b9834380 RBX: ffff8801b9f8d8c0 RCX: ffffffff8608c6d7
      RDX: 0000000000000000 RSI: ffffffff8608cc63 RDI: 0000000000000006
      RBP: ffff8801d83a7068 R08: ffff8801b9834380 R09: 0000000000000000
      R10: ffff8801d83a76d8 R11: 0000000000000000 R12: 0000000000000001
      R13: 0000000000010001 R14: 000000000000ffff R15: 00000000000000a8
      FS:  00007f1a66db5700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f7d77f091b0 CR3: 00000001ba252000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       skb_csum_hwoffload_help+0x8f/0xe0 net/core/dev.c:3269
       validate_xmit_skb+0xa2a/0xf30 net/core/dev.c:3312
       __dev_queue_xmit+0xc2f/0x3950 net/core/dev.c:3797
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
       packet_snd net/packet/af_packet.c:2928 [inline]
       packet_sendmsg+0x422d/0x64c0 net/packet/af_packet.c:2953
      
      Fixes: 5ff8dda3 ("net: Ensure partial checksum offset is inside the skb head")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52b5d6f5
    • M
      devlink: Add helper function for safely copy string param · bde74ad1
      Moshe Shemesh 提交于
      Devlink string param buffer is allocated at the size of
      DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure
      this size is not exceeded.
      Renamed DEVLINK_PARAM_MAX_STRING_VALUE to
      __DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by
      devlink only. The driver should use the helper function instead to
      verify it doesn't exceed the allowed length.
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bde74ad1
    • M
      devlink: Fix param cmode driverinit for string type · 1276534c
      Moshe Shemesh 提交于
      Driverinit configuration mode value is held by devlink to enable the
      driver fetch the value after reload command. In case the param type is
      string devlink should copy the value from driver string buffer to
      devlink string buffer on devlink_param_driverinit_value_set() and
      vice-versa on devlink_param_driverinit_value_get().
      
      Fixes: ec01aeb1 ("devlink: Add support for get/set driverinit value")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1276534c
    • M
      devlink: Fix param set handling for string type · f355cfcd
      Moshe Shemesh 提交于
      In case devlink param type is string, it needs to copy the string value
      it got from the input to devlink_param_value.
      
      Fixes: e3b7ca18 ("devlink: Add param set command")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f355cfcd
  7. 09 10月, 2018 3 次提交
    • D
      rxrpc: Fix the packet reception routine · c1e15b49
      David Howells 提交于
      The rxrpc_input_packet() function and its call tree was built around the
      assumption that data_ready() handler called from UDP to inform a kernel
      service that there is data to be had was non-reentrant.  This means that
      certain locking could be dispensed with.
      
      This, however, turns out not to be the case with a multi-queue network card
      that can deliver packets to multiple cpus simultaneously.  Each of those
      cpus can be in the rxrpc_input_packet() function at the same time.
      
      Fix by adding or changing some structure members:
      
       (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
      
       (2) Make conn->service_id into a 32-bit variable so that it can be
           cmpxchg'd on all arches.
      
       (3) Add call->input_lock to serialise access to the Rx/Tx state.  Note
           that although the Rx and Tx states are (almost) entirely separate,
           there's no point completing the separation and having separate locks
           since it's a bi-phasal RPC protocol rather than a bi-direction
           streaming protocol.  Data transmission and data reception do not take
           place simultaneously on any particular call.
      
      and making the following functional changes:
      
       (1) In rxrpc_input_data(), hold call->input_lock around the core to
           prevent simultaneous producing of packets into the Rx ring and
           updating of tracking state for a particular call.
      
       (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
           check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
           The bit test and bit clear can then be combined.  No further locking
           is needed here.
      
       (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
           the ACK packet.  The superseded ACK check is then done both before and
           after the lock is taken.
      
           The handing of ackinfo data is split, parsing before the lock is taken
           and processing with it held.  This is keyed on rxMTU being non-zero.
      
           Congestion management is also done within the locked section.
      
       (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
           rotation.  The ACKALL packet carries no information and is only really
           useful after all packets have been transmitted since it's imprecise.
      
       (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
           prevent calls being simultaneously implicitly ended on two cpus and
           also to prevent any races with incoming call setup.
      
       (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
           on a connection.  It is only permitted to happen once for a
           connection.
      
       (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
           rx->incoming_lock to see if someone else set up the call, connection
           or peer whilst we were getting there.  We can't trust the values from
           the earlier routing check unless we pin refs on them - which we want
           to avoid.
      
           Further, we need to allow for an incoming call to have its state
           changed on another CPU between us making it live and us adjusting it
           because the conn is now in the RXRPC_CONN_SERVICE state.
      
       (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
           to the RTT buffer.  Don't need to lock around setting peer->rtt.
      
      For reference, the inventory of state-accessing or state-altering functions
      used by the packet input procedure is:
      
      > rxrpc_input_packet()
        * PACKET CHECKING
      
        * ROUTING
          > rxrpc_post_packet_to_local()
          > rxrpc_find_connection_rcu() - uses RCU
            > rxrpc_lookup_peer_rcu() - uses RCU
            > rxrpc_find_service_conn_rcu() - uses RCU
            > idr_find() - uses RCU
      
        * CONNECTION-LEVEL PROCESSING
          - Service upgrade
            - Can only happen once per conn
            ! Changed to use cmpxchg
          > rxrpc_post_packet_to_conn()
          - Setting conn->hi_serial
            - Probably safe not using locks
            - Maybe use cmpxchg
      
        * CALL-LEVEL PROCESSING
          > Old-call checking
            > rxrpc_input_implicit_end_call()
              > rxrpc_call_completed()
      	> rxrpc_queue_call()
      	! Need to take rx->incoming_lock
      	> __rxrpc_disconnect_call()
      	> rxrpc_notify_socket()
          > rxrpc_new_incoming_call()
            - Uses rx->incoming_lock for the entire process
              - Might be able to drop this earlier in favour of the call lock
            > rxrpc_incoming_call()
            	! Conflicts with rxrpc_input_implicit_end_call()
          > rxrpc_send_ping()
            - Don't need locks to check rtt state
            > rxrpc_propose_ACK
      
        * PACKET DISTRIBUTION
          > rxrpc_input_call_packet()
            > rxrpc_input_data()
      	* QUEUE DATA PACKET ON CALL
      	> rxrpc_reduce_call_timer()
      	  - Uses timer_reduce()
      	! Needs call->input_lock()
      	> rxrpc_receiving_reply()
      	  ! Needs locking around ack state
      	  > rxrpc_rotate_tx_window()
      	  > rxrpc_end_tx_phase()
      	> rxrpc_proto_abort()
      	> rxrpc_input_dup_data()
      	- Fills the Rx buffer
      	- rxrpc_propose_ACK()
      	- rxrpc_notify_socket()
      
            > rxrpc_input_ack()
      	* APPLY ACK PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_input_ping_response()
      	  - Probably doesn't need any extra locking
      	  ! Need READ_ONCE() on call->ping_serial
      	  > rxrpc_input_check_for_lost_ack()
      	    - Takes call->lock to consult Tx buffer
      	  > rxrpc_peer_add_rtt()
      	    ! Needs to take a lock (peer->rtt_input_lock)
      	    ! Could perhaps manage with cmpxchg() and xadd() instead
      	> rxrpc_input_requested_ack
      	  - Consults Tx buffer
      	    ! Probably needs a lock
      	  > rxrpc_peer_add_rtt()
      	> rxrpc_propose_ack()
      	> rxrpc_input_ackinfo()
      	  - Changes call->tx_winsize
      	    ! Use cmpxchg to handle change
      	    ! Should perhaps track serial number
      	  - Uses peer->lock to record MTU specification changes
      	> rxrpc_proto_abort()
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      	> rxrpc_input_soft_acks()
      	- Consults the Tx buffer
      	> rxrpc_congestion_management()
      	  - Modifies the Tx annotations
      	  ! Needs call->input_lock()
      	  > rxrpc_queue_call()
      
            > rxrpc_input_abort()
      	* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_set_call_completion()
      	> rxrpc_notify_socket()
      
            > rxrpc_input_ackall()
      	* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      
          > rxrpc_reject_packet()
      
      There are some functions used by the above that queue the packet, after
      which the procedure is terminated:
      
       - rxrpc_post_packet_to_local()
         - local->event_queue is an sk_buff_head
         - local->processor is a work_struct
       - rxrpc_post_packet_to_conn()
         - conn->rx_queue is an sk_buff_head
         - conn->processor is a work_struct
       - rxrpc_reject_packet()
         - local->reject_queue is an sk_buff_head
         - local->processor is a work_struct
      
      And some that offload processing to process context:
      
       - rxrpc_notify_socket()
         - Uses RCU lock
         - Uses call->notify_lock to call call->notify_rx
         - Uses call->recvmsg_lock to queue recvmsg side
       - rxrpc_queue_call()
         - call->processor is a work_struct
       - rxrpc_propose_ACK()
         - Uses call->lock to wrap __rxrpc_propose_ACK()
      
      And a bunch that complete a call, all of which use call->state_lock to
      protect the call state:
      
       - rxrpc_call_completed()
       - rxrpc_set_call_completion()
       - rxrpc_abort_call()
       - rxrpc_proto_abort()
         - Also uses rxrpc_queue_call()
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c1e15b49
    • D
      rxrpc: Fix connection-level abort handling · 64753092
      David Howells 提交于
      Fix connection-level abort handling to cache the abort and error codes
      properly so that a new incoming call can be properly aborted if it races
      with the parent connection being aborted by another CPU.
      
      The abort_code and error parameters can then be dropped from
      rxrpc_abort_calls().
      
      Fixes: f5c17aae ("rxrpc: Calls should only have one terminal state")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      64753092
    • D
      rxrpc: Only take the rwind and mtu values from latest ACK · 298bc15b
      David Howells 提交于
      Move the out-of-order and duplicate ACK packet check to before the call to
      rxrpc_input_ackinfo() so that the receive window size and MTU size are only
      checked in the latest ACK packet and don't regress.
      
      Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      298bc15b