1. 05 11月, 2019 1 次提交
    • H
      timekeeping/vsyscall: Update VDSO data unconditionally · 52338415
      Huacai Chen 提交于
      The update of the VDSO data is depending on __arch_use_vsyscall() returning
      True. This is a leftover from the attempt to map the features of various
      architectures 1:1 into generic code.
      
      The usage of __arch_use_vsyscall() in the actual vsyscall implementations
      got dropped and replaced by the requirement for the architecture code to
      return U64_MAX if the global clocksource is not usable in the VDSO.
      
      But the __arch_use_vsyscall() check in the update code stayed which causes
      the VDSO data to be stale or invalid when an architecture actually
      implements that function and returns False when the current clocksource is
      not usable in the VDSO.
      
      As a consequence the VDSO implementations of clock_getres(), time(),
      clock_gettime(CLOCK_.*_COARSE) operate on invalid data and return bogus
      information.
      
      Remove the __arch_use_vsyscall() check from the VDSO update function and
      update the VDSO data unconditionally.
      
      [ tglx: Massaged changelog and removed the now useless implementations in
        	asm-generic/ARM64/MIPS ]
      
      Fixes: 44f57d78 ("timekeeping: Provide a generic update_vsyscall() implementation")
      Signed-off-by: NHuacai Chen <chenhc@lemote.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1571887709-11447-1-git-send-email-chenhc@lemote.com
      52338415
  2. 01 11月, 2019 1 次提交
    • E
      net: increase SOMAXCONN to 4096 · 19f92a03
      Eric Dumazet 提交于
      SOMAXCONN is /proc/sys/net/core/somaxconn default value.
      
      It has been defined as 128 more than 20 years ago.
      
      Since it caps the listen() backlog values, the very small value has
      caused numerous problems over the years, and many people had
      to raise it on their hosts after beeing hit by problems.
      
      Google has been using 1024 for at least 15 years, and we increased
      this to 4096 after TCP listener rework has been completed, more than
      4 years ago. We got no complain of this change breaking any
      legacy application.
      
      Many applications indeed setup a TCP listener with listen(fd, -1);
      meaning they let the system select the backlog.
      
      Raising SOMAXCONN lowers chance of the port being unavailable under
      even small SYNFLOOD attack, and reduces possibilities of side channel
      vulnerabilities.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f92a03
  3. 31 10月, 2019 5 次提交
    • J
      efi/efi_test: Lock down /dev/efi_test and require CAP_SYS_ADMIN · 359efcc2
      Javier Martinez Canillas 提交于
      The driver exposes EFI runtime services to user-space through an IOCTL
      interface, calling the EFI services function pointers directly without
      using the efivar API.
      
      Disallow access to the /dev/efi_test character device when the kernel is
      locked down to prevent arbitrary user-space to call EFI runtime services.
      
      Also require CAP_SYS_ADMIN to open the chardev to prevent unprivileged
      users to call the EFI runtime services, instead of just relying on the
      chardev file mode bits for this.
      
      The main user of this driver is the fwts [0] tool that already checks if
      the effective user ID is 0 and fails otherwise. So this change shouldn't
      cause any regression to this tool.
      
      [0]: https://wiki.ubuntu.com/FirmwareTestSuite/Reference/uefivarinfoSigned-off-by: NJavier Martinez Canillas <javierm@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NLaszlo Ersek <lersek@redhat.com>
      Acked-by: NMatthew Garrett <mjg59@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191029173755.27149-7-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      359efcc2
    • K
      x86, efi: Never relocate kernel below lowest acceptable address · 220dd769
      Kairui Song 提交于
      Currently, kernel fails to boot on some HyperV VMs when using EFI.
      And it's a potential issue on all x86 platforms.
      
      It's caused by broken kernel relocation on EFI systems, when below three
      conditions are met:
      
      1. Kernel image is not loaded to the default address (LOAD_PHYSICAL_ADDR)
         by the loader.
      2. There isn't enough room to contain the kernel, starting from the
         default load address (eg. something else occupied part the region).
      3. In the memmap provided by EFI firmware, there is a memory region
         starts below LOAD_PHYSICAL_ADDR, and suitable for containing the
         kernel.
      
      EFI stub will perform a kernel relocation when condition 1 is met. But
      due to condition 2, EFI stub can't relocate kernel to the preferred
      address, so it fallback to ask EFI firmware to alloc lowest usable memory
      region, got the low region mentioned in condition 3, and relocated
      kernel there.
      
      It's incorrect to relocate the kernel below LOAD_PHYSICAL_ADDR. This
      is the lowest acceptable kernel relocation address.
      
      The first thing goes wrong is in arch/x86/boot/compressed/head_64.S.
      Kernel decompression will force use LOAD_PHYSICAL_ADDR as the output
      address if kernel is located below it. Then the relocation before
      decompression, which move kernel to the end of the decompression buffer,
      will overwrite other memory region, as there is no enough memory there.
      
      To fix it, just don't let EFI stub relocate the kernel to any address
      lower than lowest acceptable address.
      
      [ ardb: introduce efi_low_alloc_above() to reduce the scope of the change ]
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191029173755.27149-6-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      220dd769
    • E
      net: annotate lockless accesses to sk->sk_napi_id · ee8d153d
      Eric Dumazet 提交于
      We already annotated most accesses to sk->sk_napi_id
      
      We missed sk_mark_napi_id() and sk_mark_napi_id_once()
      which might be called without socket lock held in UDP stack.
      
      KCSAN reported :
      BUG: KCSAN: data-race in udpv6_queue_rcv_one_skb / udpv6_queue_rcv_one_skb
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 0:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 1:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 10890 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: e68b6e50 ("udp: enable busy polling for all sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee8d153d
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 7170a977
      Eric Dumazet 提交于
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7170a977
    • T
      SUNRPC: Destroy the back channel when we destroy the host transport · 669996ad
      Trond Myklebust 提交于
      When we're destroying the host transport mechanism, we should ensure
      that we do not leak memory by failing to release any back channel
      slots that might still exist.
      Reported-by: NNeil Brown <neilb@suse.de>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      669996ad
  4. 30 10月, 2019 1 次提交
  5. 29 10月, 2019 2 次提交
    • T
      net: fix sk_page_frag() recursion from memory reclaim · 20eb4f29
      Tejun Heo 提交于
      sk_page_frag() optimizes skb_frag allocations by using per-task
      skb_frag cache when it knows it's the only user.  The condition is
      determined by seeing whether the socket allocation mask allows
      blocking - if the allocation may block, it obviously owns the task's
      context and ergo exclusively owns current->task_frag.
      
      Unfortunately, this misses recursion through memory reclaim path.
      Please take a look at the following backtrace.
      
       [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
           ...
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           sock_xmit.isra.24+0xa1/0x170 [nbd]
           nbd_send_cmd+0x1d2/0x690 [nbd]
           nbd_queue_rq+0x1b5/0x3b0 [nbd]
           __blk_mq_try_issue_directly+0x108/0x1b0
           blk_mq_request_issue_directly+0xbd/0xe0
           blk_mq_try_issue_list_directly+0x41/0xb0
           blk_mq_sched_insert_requests+0xa2/0xe0
           blk_mq_flush_plug_list+0x205/0x2a0
           blk_flush_plug_list+0xc3/0xf0
       [1] blk_finish_plug+0x21/0x2e
           _xfs_buf_ioapply+0x313/0x460
           __xfs_buf_submit+0x67/0x220
           xfs_buf_read_map+0x113/0x1a0
           xfs_trans_read_buf_map+0xbf/0x330
           xfs_btree_read_buf_block.constprop.42+0x95/0xd0
           xfs_btree_lookup_get_block+0x95/0x170
           xfs_btree_lookup+0xcc/0x470
           xfs_bmap_del_extent_real+0x254/0x9a0
           __xfs_bunmapi+0x45c/0xab0
           xfs_bunmapi+0x15/0x30
           xfs_itruncate_extents_flags+0xca/0x250
           xfs_free_eofblocks+0x181/0x1e0
           xfs_fs_destroy_inode+0xa8/0x1b0
           destroy_inode+0x38/0x70
           dispose_list+0x35/0x50
           prune_icache_sb+0x52/0x70
           super_cache_scan+0x120/0x1a0
           do_shrink_slab+0x120/0x290
           shrink_slab+0x216/0x2b0
           shrink_node+0x1b6/0x4a0
           do_try_to_free_pages+0xc6/0x370
           try_to_free_mem_cgroup_pages+0xe3/0x1e0
           try_charge+0x29e/0x790
           mem_cgroup_charge_skmem+0x6a/0x100
           __sk_mem_raise_allocated+0x18e/0x390
           __sk_mem_schedule+0x2a/0x40
       [0] tcp_sendmsg_locked+0x8eb/0xe10
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           ___sys_sendmsg+0x26d/0x2b0
           __sys_sendmsg+0x57/0xa0
           do_syscall_64+0x42/0x100
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      In [0], tcp_send_msg_locked() was using current->page_frag when it
      called sk_wmem_schedule().  It already calculated how many bytes can
      be fit into current->page_frag.  Due to memory pressure,
      sk_wmem_schedule() called into memory reclaim path which called into
      xfs and then IO issue path.  Because the filesystem in question is
      backed by nbd, the control goes back into the tcp layer - back into
      tcp_sendmsg_locked().
      
      nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
      sense - it's in the process of freeing memory and wants to be able to,
      e.g., drop clean pages to make forward progress.  However, this
      confused sk_page_frag() called from [2].  Because it only tests
      whether the allocation allows blocking which it does, it now thinks
      current->page_frag can be used again although it already was being
      used in [0].
      
      After [2] used current->page_frag, the offset would be increased by
      the used amount.  When the control returns to [0],
      current->page_frag's offset is increased and the previously calculated
      number of bytes now may overrun the end of allocated memory leading to
      silent memory corruptions.
      
      Fix it by adding gfpflags_normal_context() which tests sleepable &&
      !reclaim and use it to determine whether to use current->task_frag.
      
      v2: Eric didn't like gfp flags being tested twice.  Introduce a new
          helper gfpflags_normal_context() and combine the two tests.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20eb4f29
    • E
      net: add skb_queue_empty_lockless() · d7d16a89
      Eric Dumazet 提交于
      Some paths call skb_queue_empty() without holding
      the queue lock. We must use a barrier in order
      to not let the compiler do strange things, and avoid
      KCSAN splats.
      
      Adding a barrier in skb_queue_empty() might be overkill,
      I prefer adding a new helper to clearly identify
      points where the callers might be lockless. This might
      help us finding real bugs.
      
      The corresponding WRITE_ONCE() should add zero cost
      for current compilers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d16a89
  6. 28 10月, 2019 2 次提交
  7. 26 10月, 2019 2 次提交
    • G
      netns: fix GFP flags in rtnl_net_notifyid() · d4e4fdf9
      Guillaume Nault 提交于
      In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
      rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
      but there are a few paths calling rtnl_net_notifyid() from atomic
      context or from RCU critical sections. The later also precludes the use
      of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
      call is wrong too, as it uses GFP_KERNEL unconditionally.
      
      Therefore, we need to pass the GFP flags as parameter and propagate it
      through function calls until the proper flags can be determined.
      
      In most cases, GFP_KERNEL is fine. The exceptions are:
        * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
          indirectly call rtnl_net_notifyid() from RCU critical section,
      
        * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
          parameter.
      
      Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
      by nlmsg_new(). The function is allowed to sleep, so better make the
      flags consistent with the ones used in the following
      ovs_vport_cmd_fill_info() call.
      
      Found by code inspection.
      
      Fixes: 9a963454 ("netns: notify netns id events")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e4fdf9
    • B
      net: hwbm: if CONFIG_NET_HWBM unset, make stub functions static · 91e2e576
      Ben Dooks (Codethink) 提交于
      If CONFIG_NET_HWBM is not set, then these stub functions in
      <net/hwbm.h> should be declared static to avoid trying to
      export them from any driver that includes this.
      
      Fixes the following sparse warnings:
      
      ./include/net/hwbm.h:24:6: warning: symbol 'hwbm_buf_free' was not declared. Should it be static?
      ./include/net/hwbm.h:25:5: warning: symbol 'hwbm_pool_refill' was not declared. Should it be static?
      ./include/net/hwbm.h:26:5: warning: symbol 'hwbm_pool_add' was not declared. Should it be static?
      Signed-off-by: NBen Dooks (Codethink) <ben.dooks@codethink.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91e2e576
  8. 25 10月, 2019 7 次提交
    • T
      net: remove unnecessary variables and callback · f3b0a18b
      Taehee Yoo 提交于
      This patch removes variables and callback these are related to the nested
      device structure.
      devices that can be nested have their own nest_level variable that
      represents the depth of nested devices.
      In the previous patch, new {lower/upper}_level variables are added and
      they replace old private nest_level variable.
      So, this patch removes all 'nest_level' variables.
      
      In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
      to get lockdep subclass value, which is actually lower nested depth value.
      But now, they use the dynamic lockdep key to avoid lockdep warning instead
      of the subclass.
      So, this patch removes ->ndo_get_lock_subclass() callback.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3b0a18b
    • T
      vxlan: add adjacent link to limit depth level · 0ce1822c
      Taehee Yoo 提交于
      Current vxlan code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this routine needs
      huge stack memory. So, unlimited nested devices could make
      stack overflow.
      
      In order to fix this issue, this patch adds adjacent links.
      The adjacent link APIs internally check the depth level.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add vxlan0 type vxlan id 0 group 239.1.1.1 dev dummy0 \
      	    dstport 4789
          for i in {1..100}
          do
      	    let A=$i-1
      	    ip link add vxlan$i type vxlan id $i group 239.1.1.1 \
      		    dev vxlan$A dstport 4789
          done
          ip link del dummy0
      
      The top upper link is vxlan100 and the lowest link is vxlan0.
      When vxlan0 is deleting, the upper devices will be deleted recursively.
      It needs huge stack memory so it makes stack overflow.
      
      Splat looks like:
      [  229.628477] =============================================================================
      [  229.629785] BUG page->ptl (Not tainted): Padding overwritten. 0x0000000026abf214-0x0000000091f6abb2
      [  229.629785] -----------------------------------------------------------------------------
      [  229.629785]
      [  229.655439] ==================================================================
      [  229.629785] INFO: Slab 0x00000000ff7cfda8 objects=19 used=19 fp=0x00000000fe33776c flags=0x200000000010200
      [  229.655688] BUG: KASAN: stack-out-of-bounds in unmap_single_vma+0x25a/0x2e0
      [  229.655688] Read of size 8 at addr ffff888113076928 by task vlan-network-in/2334
      [  229.655688]
      [  229.629785] Padding 0000000026abf214: 00 80 14 0d 81 88 ff ff 68 91 81 14 81 88 ff ff  ........h.......
      [  229.629785] Padding 0000000001e24790: 38 91 81 14 81 88 ff ff 68 91 81 14 81 88 ff ff  8.......h.......
      [  229.629785] Padding 00000000b39397c8: 33 30 62 a7 ff ff ff ff ff eb 60 22 10 f1 ff 1f  30b.......`"....
      [  229.629785] Padding 00000000bc98f53a: 80 60 07 13 81 88 ff ff 00 80 14 0d 81 88 ff ff  .`..............
      [  229.629785] Padding 000000002aa8123d: 68 91 81 14 81 88 ff ff f7 21 17 a7 ff ff ff ff  h........!......
      [  229.629785] Padding 000000001c8c2369: 08 81 14 0d 81 88 ff ff 03 02 00 00 00 00 00 00  ................
      [  229.629785] Padding 000000004e290c5d: 21 90 a2 21 10 ed ff ff 00 00 00 00 00 fc ff df  !..!............
      [  229.629785] Padding 000000000e25d731: 18 60 07 13 81 88 ff ff c0 8b 13 05 81 88 ff ff  .`..............
      [  229.629785] Padding 000000007adc7ab3: b3 8a b5 41 00 00 00 00                          ...A....
      [  229.629785] FIX page->ptl: Restoring 0x0000000026abf214-0x0000000091f6abb2=0x5a
      [  ... ]
      
      Fixes: acaf4e70 ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ce1822c
    • T
      net: core: add ignore flag to netdev_adjacent structure · 32b6d34f
      Taehee Yoo 提交于
      In order to link an adjacent node, netdev_upper_dev_link() is used
      and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used.
      unlink operation does not fail, but link operation can fail.
      
      In order to exchange adjacent nodes, we should unlink an old adjacent
      node first. then, link a new adjacent node.
      If link operation is failed, we should link an old adjacent node again.
      But this link operation can fail too.
      It eventually breaks the adjacent link relationship.
      
      This patch adds an ignore flag into the netdev_adjacent structure.
      If this flag is set, netdev_upper_dev_link() ignores an old adjacent
      node for a moment.
      
      This patch also adds new functions for other modules.
      netdev_adjacent_change_prepare()
      netdev_adjacent_change_commit()
      netdev_adjacent_change_abort()
      
      netdev_adjacent_change_prepare() inserts new device into adjacent list
      but new device is not allowed to use immediately.
      If netdev_adjacent_change_prepare() fails, it internally rollbacks
      adjacent list so that we don't need any other action.
      netdev_adjacent_change_commit() deletes old device in the adjacent list
      and allows new device to use.
      netdev_adjacent_change_abort() rollbacks adjacent list.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32b6d34f
    • T
      team: fix nested locking lockdep warning · 369f61be
      Taehee Yoo 提交于
      team interface could be nested and it's lock variable could be nested too.
      But this lock uses static lockdep key and there is no nested locking
      handling code such as mutex_lock_nested() and so on.
      so the Lockdep would warn about the circular locking scenario that
      couldn't happen.
      In order to fix, this patch makes the team module to use dynamic lock key
      instead of static key.
      
      Test commands:
          ip link add team0 type team
          ip link add team1 type team
          ip link set team0 master team1
          ip link set team0 nomaster
          ip link set team1 master team0
          ip link set team1 nomaster
      
      Splat that looks like:
      [   40.364352] WARNING: possible recursive locking detected
      [   40.364964] 5.4.0-rc3+ #96 Not tainted
      [   40.365405] --------------------------------------------
      [   40.365973] ip/750 is trying to acquire lock:
      [   40.366542] ffff888060b34c40 (&team->lock){+.+.}, at: team_set_mac_address+0x151/0x290 [team]
      [   40.367689]
      	       but task is already holding lock:
      [   40.368729] ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
      [   40.370280]
      	       other info that might help us debug this:
      [   40.371159]  Possible unsafe locking scenario:
      
      [   40.371942]        CPU0
      [   40.372338]        ----
      [   40.372673]   lock(&team->lock);
      [   40.373115]   lock(&team->lock);
      [   40.373549]
      	       *** DEADLOCK ***
      
      [   40.374432]  May be due to missing lock nesting notation
      
      [   40.375338] 2 locks held by ip/750:
      [   40.375851]  #0: ffffffffabcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
      [   40.376927]  #1: ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
      [   40.377989]
      	       stack backtrace:
      [   40.378650] CPU: 0 PID: 750 Comm: ip Not tainted 5.4.0-rc3+ #96
      [   40.379368] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [   40.380574] Call Trace:
      [   40.381208]  dump_stack+0x7c/0xbb
      [   40.381959]  __lock_acquire+0x269d/0x3de0
      [   40.382817]  ? register_lock_class+0x14d0/0x14d0
      [   40.383784]  ? check_chain_key+0x236/0x5d0
      [   40.384518]  lock_acquire+0x164/0x3b0
      [   40.385074]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.385805]  __mutex_lock+0x14d/0x14c0
      [   40.386371]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.387038]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.387632]  ? mutex_lock_io_nested+0x1380/0x1380
      [   40.388245]  ? team_del_slave+0x60/0x60 [team]
      [   40.388752]  ? rcu_read_lock_sched_held+0x90/0xc0
      [   40.389304]  ? rcu_read_lock_bh_held+0xa0/0xa0
      [   40.389819]  ? lock_acquire+0x164/0x3b0
      [   40.390285]  ? lockdep_rtnl_is_held+0x16/0x20
      [   40.390797]  ? team_port_get_rtnl+0x90/0xe0 [team]
      [   40.391353]  ? __module_text_address+0x13/0x140
      [   40.391886]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.392547]  team_set_mac_address+0x151/0x290 [team]
      [   40.393111]  dev_set_mac_address+0x1f0/0x3f0
      [ ... ]
      
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      369f61be
    • T
      bonding: use dynamic lockdep key instead of subclass · 089bca2c
      Taehee Yoo 提交于
      All bonding device has same lockdep key and subclass is initialized with
      nest_level.
      But actual nest_level value can be changed when a lower device is attached.
      And at this moment, the subclass should be updated but it seems to be
      unsafe.
      So this patch makes bonding use dynamic lockdep key instead of the
      subclass.
      
      Test commands:
          ip link add bond0 type bond
      
          for i in {1..5}
          do
      	    let A=$i-1
      	    ip link add bond$i type bond
      	    ip link set bond$i master bond$A
          done
          ip link set bond5 master bond0
      
      Splat looks like:
      [  307.992912] WARNING: possible recursive locking detected
      [  307.993656] 5.4.0-rc3+ #96 Tainted: G        W
      [  307.994367] --------------------------------------------
      [  307.995092] ip/761 is trying to acquire lock:
      [  307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.997045]
      	       but task is already holding lock:
      [  307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.999215]
      	       other info that might help us debug this:
      [  308.000251]  Possible unsafe locking scenario:
      
      [  308.001137]        CPU0
      [  308.001533]        ----
      [  308.001915]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.002609]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.003302]
      		*** DEADLOCK ***
      
      [  308.004310]  May be due to missing lock nesting notation
      
      [  308.005319] 3 locks held by ip/761:
      [  308.005830]  #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
      [  308.006894]  #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  308.008243]  #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
      [  308.009422]
      	       stack backtrace:
      [  308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G        W         5.4.0-rc3+ #96
      [  308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  308.012179] Call Trace:
      [  308.012601]  dump_stack+0x7c/0xbb
      [  308.013089]  __lock_acquire+0x269d/0x3de0
      [  308.013669]  ? register_lock_class+0x14d0/0x14d0
      [  308.014318]  lock_acquire+0x164/0x3b0
      [  308.014858]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.015520]  _raw_spin_lock_nested+0x2e/0x60
      [  308.016129]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.017215]  bond_get_stats+0xb8/0x500 [bonding]
      [  308.018454]  ? bond_arp_rcv+0xf10/0xf10 [bonding]
      [  308.019710]  ? rcu_read_lock_held+0x90/0xa0
      [  308.020605]  ? rcu_read_lock_sched_held+0xc0/0xc0
      [  308.021286]  ? bond_get_stats+0x9f/0x500 [bonding]
      [  308.021953]  dev_get_stats+0x1ec/0x270
      [  308.022508]  bond_get_stats+0x1d1/0x500 [bonding]
      
      Fixes: d3fff6c4 ("net: add netdev_lockdep_set_classes() helper")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      089bca2c
    • T
      net: core: add generic lockdep keys · ab92d68f
      Taehee Yoo 提交于
      Some interface types could be nested.
      (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
      These interface types should set lockdep class because, without lockdep
      class key, lockdep always warn about unexisting circular locking.
      
      In the current code, these interfaces have their own lockdep class keys and
      these manage itself. So that there are so many duplicate code around the
      /driver/net and /net/.
      This patch adds new generic lockdep keys and some helper functions for it.
      
      This patch does below changes.
      a) Add lockdep class keys in struct net_device
         - qdisc_running, xmit, addr_list, qdisc_busylock
         - these keys are used as dynamic lockdep key.
      b) When net_device is being allocated, lockdep keys are registered.
         - alloc_netdev_mqs()
      c) When net_device is being free'd llockdep keys are unregistered.
         - free_netdev()
      d) Add generic lockdep key helper function
         - netdev_register_lockdep_key()
         - netdev_unregister_lockdep_key()
         - netdev_update_lockdep_key()
      e) Remove unnecessary generic lockdep macro and functions
      f) Remove unnecessary lockdep code of each interfaces.
      
      After this patch, each interface modules don't need to maintain
      their lockdep keys.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab92d68f
    • T
      net: core: limit nested device depth · 5343da4c
      Taehee Yoo 提交于
      Current code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this needs huge stack
      memory. So, unlimited nested devices could make stack overflow.
      
      This patch adds upper_level and lower_level, they are common variables
      and represent maximum lower/upper depth.
      When upper/lower device is attached or dettached,
      {lower/upper}_level are updated. and if maximum depth is bigger than 8,
      attach routine fails and returns -EMLINK.
      
      In addition, this patch converts recursive routine of
      netdev_walk_all_{lower/upper} to iterator routine.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add link dummy0 name vlan1 type vlan id 1
          ip link set vlan1 up
      
          for i in {2..55}
          do
      	    let A=$i-1
      
      	    ip link add vlan$i link vlan$A type vlan id $i
          done
          ip link del dummy0
      
      Splat looks like:
      [  155.513226][  T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850
      [  155.514162][  T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908
      [  155.515048][  T908]
      [  155.515333][  T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96
      [  155.516147][  T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  155.517233][  T908] Call Trace:
      [  155.517627][  T908]
      [  155.517918][  T908] Allocated by task 0:
      [  155.518412][  T908] (stack is not available)
      [  155.518955][  T908]
      [  155.519228][  T908] Freed by task 0:
      [  155.519885][  T908] (stack is not available)
      [  155.520452][  T908]
      [  155.520729][  T908] The buggy address belongs to the object at ffff8880608a6ac0
      [  155.520729][  T908]  which belongs to the cache names_cache of size 4096
      [  155.522387][  T908] The buggy address is located 512 bytes inside of
      [  155.522387][  T908]  4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0)
      [  155.523920][  T908] The buggy address belongs to the page:
      [  155.524552][  T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0
      [  155.525836][  T908] flags: 0x100000000010200(slab|head)
      [  155.526445][  T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0
      [  155.527424][  T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  155.528429][  T908] page dumped because: kasan: bad access detected
      [  155.529158][  T908]
      [  155.529410][  T908] Memory state around the buggy address:
      [  155.530060][  T908]  ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.530971][  T908]  ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3
      [  155.531889][  T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.532806][  T908]                                            ^
      [  155.533509][  T908]  ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00
      [  155.534436][  T908]  ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00
      [ ... ]
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5343da4c
  9. 24 10月, 2019 2 次提交
    • E
      ipvs: move old_secure_tcp into struct netns_ipvs · c24b75e0
      Eric Dumazet 提交于
      syzbot reported the following issue :
      
      BUG: KCSAN: data-race in update_defense_level / update_defense_level
      
      read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1:
       update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0:
       update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events defense_work_handler
      
      Indeed, old_secure_tcp is currently a static variable, while it
      needs to be a per netns variable.
      
      Fixes: a0840e2e ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      c24b75e0
    • E
      net/flow_dissector: switch to siphash · 55667441
      Eric Dumazet 提交于
      UDP IPv6 packets auto flowlabels are using a 32bit secret
      (static u32 hashrnd in net/core/flow_dissector.c) and
      apply jhash() over fields known by the receivers.
      
      Attackers can easily infer the 32bit secret and use this information
      to identify a device and/or user, since this 32bit secret is only
      set at boot time.
      
      Really, using jhash() to generate cookies sent on the wire
      is a serious security concern.
      
      Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
      a dead end. Trying to periodically change the secret (like in sch_sfq.c)
      could change paths taken in the network for long lived flows.
      
      Let's switch to siphash, as we did in commit df453700
      ("inet: switch IP ID generator to siphash")
      
      Using a cryptographically strong pseudo random function will solve this
      privacy issue and more generally remove other weak points in the stack.
      
      Packet schedulers using skb_get_hash_perturb() benefit from this change.
      
      Fixes: b5677416 ("ipv6: Enable auto flow labels by default")
      Fixes: 42240901 ("ipv6: Implement different admin modes for automatic flow labels")
      Fixes: 67800f9b ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
      Fixes: cb1ce2ef ("ipv6: Implement automatic flow label generation on transmit")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Berger <jonathann1@walla.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55667441
  10. 23 10月, 2019 4 次提交
    • A
      fuse: Add changelog entries for protocols 7.1 - 7.8 · 9de55a37
      Alan Somers 提交于
      Retroactively add changelog entry for FUSE protocols 7.1 through 7.8.
      Signed-off-by: NAlan Somers <asomers@FreeBSD.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      9de55a37
    • A
      dynamic_debug: provide dynamic_hex_dump stub · 011c7289
      Arnd Bergmann 提交于
      The ionic driver started using dymamic_hex_dump(), but
      that is not always defined:
      
      drivers/net/ethernet/pensando/ionic/ionic_main.c:229:2: error: implicit declaration of function 'dynamic_hex_dump' [-Werror,-Wimplicit-function-declaration]
      
      Add a dummy implementation to use when CONFIG_DYNAMIC_DEBUG
      is disabled, printing nothing.
      
      Fixes: 938962d5 ("ionic: Add adminq action")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NShannon Nelson <snelson@pensando.io>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      011c7289
    • D
      bpf: Fix use after free in subprog's jited symbol removal · cd7455f1
      Daniel Borkmann 提交于
      syzkaller managed to trigger the following crash:
      
        [...]
        BUG: unable to handle page fault for address: ffffc90001923030
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD aa551067 P4D aa551067 PUD aa552067 PMD a572b067 PTE 80000000a1173163
        Oops: 0000 [#1] PREEMPT SMP KASAN
        CPU: 0 PID: 7982 Comm: syz-executor912 Not tainted 5.4.0-rc3+ #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:bpf_jit_binary_hdr include/linux/filter.h:787 [inline]
        RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:531 [inline]
        RIP: 0010:bpf_tree_comp kernel/bpf/core.c:600 [inline]
        RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
        RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
        RIP: 0010:bpf_prog_kallsyms_find kernel/bpf/core.c:674 [inline]
        RIP: 0010:is_bpf_text_address+0x184/0x3b0 kernel/bpf/core.c:709
        [...]
        Call Trace:
         kernel_text_address kernel/extable.c:147 [inline]
         __kernel_text_address+0x9a/0x110 kernel/extable.c:102
         unwind_get_return_address+0x4c/0x90 arch/x86/kernel/unwind_frame.c:19
         arch_stack_walk+0x98/0xe0 arch/x86/kernel/stacktrace.c:26
         stack_trace_save+0xb6/0x150 kernel/stacktrace.c:123
         save_stack mm/kasan/common.c:69 [inline]
         set_track mm/kasan/common.c:77 [inline]
         __kasan_kmalloc+0x11c/0x1b0 mm/kasan/common.c:510
         kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:518
         slab_post_alloc_hook mm/slab.h:584 [inline]
         slab_alloc mm/slab.c:3319 [inline]
         kmem_cache_alloc+0x1f5/0x2e0 mm/slab.c:3483
         getname_flags+0xba/0x640 fs/namei.c:138
         getname+0x19/0x20 fs/namei.c:209
         do_sys_open+0x261/0x560 fs/open.c:1091
         __do_sys_open fs/open.c:1115 [inline]
         __se_sys_open fs/open.c:1110 [inline]
         __x64_sys_open+0x87/0x90 fs/open.c:1110
         do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [...]
      
      After further debugging it turns out that we walk kallsyms while in parallel
      we tear down a BPF program which contains subprograms that have been JITed
      though the program itself has not been fully exposed and is eventually bailing
      out with error.
      
      The bpf_prog_kallsyms_del_subprogs() in bpf_prog_load()'s error path removes
      the symbols, however, bpf_prog_free() tears down the JIT memory too early via
      scheduled work. Instead, it needs to properly respect RCU grace period as the
      kallsyms walk for BPF is under RCU.
      
      Fix it by refactoring __bpf_prog_put()'s tear down and reuse it in our error
      path where we defer final destruction when we have subprogs in the program.
      
      Fixes: 7d1982b4 ("bpf: fix panic in prog load calls cleanup")
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Reported-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/bpf/55f6367324c2d7e9583fa9ccf5385dcbba0d7a6e.1571752452.git.daniel@iogearbox.net
      cd7455f1
    • D
      RDMA/uverbs: Prevent potential underflow · a9018adf
      Dan Carpenter 提交于
      The issue is in drivers/infiniband/core/uverbs_std_types_cq.c in the
      UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE) function.  We check that:
      
              if (attr.comp_vector >= attrs->ufile->device->num_comp_vectors) {
      
      But we don't check if "attr.comp_vector" is negative.  It could
      potentially lead to an array underflow.  My concern would be where
      cq->vector is used in the create_cq() function from the cxgb4 driver.
      
      And really "attr.comp_vector" is appears as a u32 to user space so that's
      the right type to use.
      
      Fixes: 9ee79fce ("IB/core: Add completion queue (cq) object actions")
      Link: https://lore.kernel.org/r/20191011133419.GA22905@mwandaSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      a9018adf
  11. 22 10月, 2019 1 次提交
  12. 21 10月, 2019 3 次提交
  13. 20 10月, 2019 1 次提交
    • E
      net: reorder 'struct net' fields to avoid false sharing · 2a06b898
      Eric Dumazet 提交于
      Intel test robot reported a ~7% regression on TCP_CRR tests
      that they bisected to the cited commit.
      
      Indeed, every time a new TCP socket is created or deleted,
      the atomic counter net->count is touched (via get_net(net)
      and put_net(net) calls)
      
      So cpus might have to reload a contended cache line in
      net_hash_mix(net) calls.
      
      We need to reorder 'struct net' fields to move @hash_mix
      in a read mostly cache line.
      
      We move in the first cache line fields that can be
      dirtied often.
      
      We probably will have to address in a followup patch
      the __randomize_layout that was added in linux-4.13,
      since this might break our placement choices.
      
      Fixes: 355b9855 ("netns: provide pure entropy for net_hash_mix()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a06b898
  14. 18 10月, 2019 2 次提交
  15. 17 10月, 2019 2 次提交
    • Q
      btrfs: tracepoints: Fix bad entry members of qgroup events · 1b2442b4
      Qu Wenruo 提交于
      [BUG]
      For btrfs:qgroup_meta_reserve event, the trace event can output garbage:
      
        qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2
        qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=0x258792 diff=2
      
      The @type can be completely garbage, as DATA type is not possible for
      trace_qgroup_meta_reserve() trace event.
      
      [CAUSE]
      Ther are several problems related to qgroup trace events:
      - Unassigned entry member
        Member entry::type of trace_qgroup_update_reserve() and
        trace_qgourp_meta_reserve() is not assigned
      
      - Redundant entry member
        Member entry::type is completely useless in
        trace_qgroup_meta_convert()
      
      Fixes: 4ee0d883 ("btrfs: qgroup: Update trace events for metadata reservation")
      CC: stable@vger.kernel.org # 4.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b2442b4
    • J
      arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled · 19c95f26
      Julien Thierry 提交于
      Preempting from IRQ-return means that the task has its PSTATE saved
      on the stack, which will get restored when the task is resumed and does
      the actual IRQ return.
      
      However, enabling some CPU features requires modifying the PSTATE. This
      means that, if a task was scheduled out during an IRQ-return before all
      CPU features are enabled, the task might restore a PSTATE that does not
      include the feature enablement changes once scheduled back in.
      
      * Task 1:
      
      PAN == 0 ---|                          |---------------
                  |                          |<- return from IRQ, PSTATE.PAN = 0
                  | <- IRQ                   |
                  +--------+ <- preempt()  +--
                                           ^
                                           |
                                           reschedule Task 1, PSTATE.PAN == 1
      * Init:
              --------------------+------------------------
                                  ^
                                  |
                                  enable_cpu_features
                                  set PSTATE.PAN on all CPUs
      
      Worse than this, since PSTATE is untouched when task switching is done,
      a task missing the new bits in PSTATE might affect another task, if both
      do direct calls to schedule() (outside of IRQ/exception contexts).
      
      Fix this by preventing preemption on IRQ-return until features are
      enabled on all CPUs.
      
      This way the only PSTATE values that are saved on the stack are from
      synchronous exceptions. These are expected to be fatal this early, the
      exception is BRK for WARN_ON(), but as this uses do_debug_exception()
      which keeps IRQs masked, it shouldn't call schedule().
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      [james: Replaced a really cool hack, with an even simpler static key in C.
       expanded commit message with Julien's cover-letter ascii art]
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      19c95f26
  16. 16 10月, 2019 1 次提交
    • D
      net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions · fa4e0f88
      Davide Caratti 提交于
      the following script:
      
       # tc qdisc add dev eth0 clsact
       # tc filter add dev eth0 egress protocol ip matchall \
       > action mpls push protocol mpls_uc label 0x355aa bos 1
      
      causes corruption of all IP packets transmitted by eth0. On TC egress, we
      can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push'
      operation will result in an overwrite of the first 4 octets in the packet
      L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same
      error pattern is present also in the MPLS 'pop' operation. Fix this error
      in act_mpls data plane, computing 'mac_len' as the difference between the
      network header and the mac header (when not at TC ingress), and use it in
      MPLS 'push'/'pop' core functions.
      
      v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len'
          in skb_mpls_pop(), reported by kbuild test robot
      
      CC: Lorenzo Bianconi <lorenzo@kernel.org>
      Fixes: 2a2ea508 ("net: sched: add mpls manipulation actions to TC")
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Acked-by: NJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa4e0f88
  17. 15 10月, 2019 3 次提交