1. 29 11月, 2022 1 次提交
    • D
      iomap: write iomap validity checks · d7b64041
      Dave Chinner 提交于
      A recent multithreaded write data corruption has been uncovered in
      the iomap write code. The core of the problem is partial folio
      writes can be flushed to disk while a new racing write can map it
      and fill the rest of the page:
      
      writeback			new write
      
      allocate blocks
        blocks are unwritten
      submit IO
      .....
      				map blocks
      				iomap indicates UNWRITTEN range
      				loop {
      				  lock folio
      				  copyin data
      .....
      IO completes
        runs unwritten extent conv
          blocks are marked written
      				  <iomap now stale>
      				  get next folio
      				}
      
      Now add memory pressure such that memory reclaim evicts the
      partially written folio that has already been written to disk.
      
      When the new write finally gets to the last partial page of the new
      write, it does not find it in cache, so it instantiates a new page,
      sees the iomap is unwritten, and zeros the part of the page that
      it does not have data from. This overwrites the data on disk that
      was originally written.
      
      The full description of the corruption mechanism can be found here:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      To solve this problem, we need to check whether the iomap is still
      valid after we lock each folio during the write. We have to do it
      after we lock the page so that we don't end up with state changes
      occurring while we wait for the folio to be locked.
      
      Hence we need a mechanism to be able to check that the cached iomap
      is still valid (similar to what we already do in buffered
      writeback), and we need a way for ->begin_write to back out and
      tell the high level iomap iterator that we need to remap the
      remaining write range.
      
      The iomap needs to grow some storage for the validity cookie that
      the filesystem provides to travel with the iomap. XFS, in
      particular, also needs to know some more information about what the
      iomap maps (attribute extents rather than file data extents) to for
      the validity cookie to cover all the types of iomaps we might need
      to validate.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      d7b64041
  2. 23 11月, 2022 1 次提交
    • D
      xfs,iomap: move delalloc punching to iomap · 9c7babf9
      Dave Chinner 提交于
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have 
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      9c7babf9
  3. 02 11月, 2022 1 次提交
  4. 01 11月, 2022 1 次提交
  5. 29 10月, 2022 5 次提交
  6. 28 10月, 2022 1 次提交
    • T
      net/mlx5: Fix possible use-after-free in async command interface · bacd22df
      Tariq Toukan 提交于
      mlx5_cmd_cleanup_async_ctx should return only after all its callback
      handlers were completed. Before this patch, the below race between
      mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and
      lead to a use-after-free:
      
      1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e.
         elevated by 1, a single inflight callback).
      2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1.
      3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and
         is about to call wake_up().
      4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns
         immediately as the condition (num_inflight == 0) holds.
      5. mlx5_cmd_cleanup_async_ctx returns.
      6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx
         object.
      7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed
         object.
      
      Fix it by syncing using a completion object. Mark it completed when
      num_inflight reaches 0.
      
      Trace:
      
      BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270
      Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0
      
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <IRQ>
       dump_stack_lvl+0x57/0x7d
       print_report.cold+0x2d5/0x684
       ? do_raw_spin_lock+0x23d/0x270
       kasan_report+0xb1/0x1a0
       ? do_raw_spin_lock+0x23d/0x270
       do_raw_spin_lock+0x23d/0x270
       ? rwlock_bug.part.0+0x90/0x90
       ? __delete_object+0xb8/0x100
       ? lock_downgrade+0x6e0/0x6e0
       _raw_spin_lock_irqsave+0x43/0x60
       ? __wake_up_common_lock+0xb9/0x140
       __wake_up_common_lock+0xb9/0x140
       ? __wake_up_common+0x650/0x650
       ? destroy_tis_callback+0x53/0x70 [mlx5_core]
       ? kasan_set_track+0x21/0x30
       ? destroy_tis_callback+0x53/0x70 [mlx5_core]
       ? kfree+0x1ba/0x520
       ? do_raw_spin_unlock+0x54/0x220
       mlx5_cmd_exec_cb_handler+0x136/0x1a0 [mlx5_core]
       ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
       ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
       mlx5_cmd_comp_handler+0x65a/0x12b0 [mlx5_core]
       ? dump_command+0xcc0/0xcc0 [mlx5_core]
       ? lockdep_hardirqs_on_prepare+0x400/0x400
       ? cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
       cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
       atomic_notifier_call_chain+0xd7/0x1d0
       mlx5_eq_async_int+0x3ce/0xa20 [mlx5_core]
       atomic_notifier_call_chain+0xd7/0x1d0
       ? irq_release+0x140/0x140 [mlx5_core]
       irq_int_handler+0x19/0x30 [mlx5_core]
       __handle_irq_event_percpu+0x1f2/0x620
       handle_irq_event+0xb2/0x1d0
       handle_edge_irq+0x21e/0xb00
       __common_interrupt+0x79/0x1a0
       common_interrupt+0x78/0xa0
       </IRQ>
       <TASK>
       asm_common_interrupt+0x22/0x40
      RIP: 0010:default_idle+0x42/0x60
      Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 eb 47 22 02 85 c0 7e 07 0f 00 2d e0 9f 48 00 fb f4 <c3> 48 c7 c7 80 08 7f 85 e8 d1 d3 3e fe eb de 66 66 2e 0f 1f 84 00
      RSP: 0018:ffff888100dbfdf0 EFLAGS: 00000242
      RAX: 0000000000000001 RBX: ffffffff84ecbd48 RCX: 1ffffffff0afe110
      RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835cc9bc
      RBP: 0000000000000005 R08: 0000000000000001 R09: ffff88881dec4ac3
      R10: ffffed1103bd8958 R11: 0000017d0ca571c9 R12: 0000000000000005
      R13: ffffffff84f024e0 R14: 0000000000000000 R15: dffffc0000000000
       ? default_idle_call+0xcc/0x450
       default_idle_call+0xec/0x450
       do_idle+0x394/0x450
       ? arch_cpu_idle_exit+0x40/0x40
       ? do_idle+0x17/0x450
       cpu_startup_entry+0x19/0x20
       start_secondary+0x221/0x2b0
       ? set_cpu_sibling_map+0x2070/0x2070
       secondary_startup_64_no_verify+0xcd/0xdb
       </TASK>
      
      Allocated by task 49502:
       kasan_save_stack+0x1e/0x40
       __kasan_kmalloc+0x81/0xa0
       kvmalloc_node+0x48/0xe0
       mlx5e_bulk_async_init+0x35/0x110 [mlx5_core]
       mlx5e_tls_priv_tx_list_cleanup+0x84/0x3e0 [mlx5_core]
       mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
       mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
       mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
       mlx5e_suspend+0xdb/0x140 [mlx5_core]
       mlx5e_remove+0x89/0x190 [mlx5_core]
       auxiliary_bus_remove+0x52/0x70
       device_release_driver_internal+0x40f/0x650
       driver_detach+0xc1/0x180
       bus_remove_driver+0x125/0x2f0
       auxiliary_driver_unregister+0x16/0x50
       mlx5e_cleanup+0x26/0x30 [mlx5_core]
       cleanup+0xc/0x4e [mlx5_core]
       __x64_sys_delete_module+0x2b5/0x450
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Freed by task 49502:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       kasan_set_free_info+0x20/0x30
       ____kasan_slab_free+0x11d/0x1b0
       kfree+0x1ba/0x520
       mlx5e_tls_priv_tx_list_cleanup+0x2e7/0x3e0 [mlx5_core]
       mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
       mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
       mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
       mlx5e_suspend+0xdb/0x140 [mlx5_core]
       mlx5e_remove+0x89/0x190 [mlx5_core]
       auxiliary_bus_remove+0x52/0x70
       device_release_driver_internal+0x40f/0x650
       driver_detach+0xc1/0x180
       bus_remove_driver+0x125/0x2f0
       auxiliary_driver_unregister+0x16/0x50
       mlx5e_cleanup+0x26/0x30 [mlx5_core]
       cleanup+0xc/0x4e [mlx5_core]
       __x64_sys_delete_module+0x2b5/0x450
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: e355477e ("net/mlx5: Make mlx5_cmd_exec_cb() a safe API")
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221026135153.154807-8-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      bacd22df
  7. 27 10月, 2022 3 次提交
  8. 26 10月, 2022 1 次提交
  9. 25 10月, 2022 4 次提交
  10. 24 10月, 2022 2 次提交
  11. 22 10月, 2022 2 次提交
  12. 21 10月, 2022 4 次提交
  13. 20 10月, 2022 5 次提交
  14. 19 10月, 2022 2 次提交
  15. 18 10月, 2022 1 次提交
    • K
      udp: Update reuse->has_conns under reuseport_lock. · 69421bf9
      Kuniyuki Iwashima 提交于
      When we call connect() for a UDP socket in a reuseport group, we have
      to update sk->sk_reuseport_cb->has_conns to 1.  Otherwise, the kernel
      could select a unconnected socket wrongly for packets sent to the
      connected socket.
      
      However, the current way to set has_conns is illegal and possible to
      trigger that problem.  reuseport_has_conns() changes has_conns under
      rcu_read_lock(), which upgrades the RCU reader to the updater.  Then,
      it must do the update under the updater's lock, reuseport_lock, but
      it doesn't for now.
      
      For this reason, there is a race below where we fail to set has_conns
      resulting in the wrong socket selection.  To avoid the race, let's split
      the reader and updater with proper locking.
      
       cpu1                               cpu2
      +----+                             +----+
      
      __ip[46]_datagram_connect()        reuseport_grow()
      .                                  .
      |- reuseport_has_conns(sk, true)   |- more_reuse = __reuseport_alloc(more_socks_size)
      |  .                               |
      |  |- rcu_read_lock()
      |  |- reuse = rcu_dereference(sk->sk_reuseport_cb)
      |  |
      |  |                               |  /* reuse->has_conns == 0 here */
      |  |                               |- more_reuse->has_conns = reuse->has_conns
      |  |- reuse->has_conns = 1         |  /* more_reuse->has_conns SHOULD BE 1 HERE */
      |  |                               |
      |  |                               |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
      |  |                               |                     more_reuse)
      |  `- rcu_read_unlock()            `- kfree_rcu(reuse, rcu)
      |
      |- sk->sk_state = TCP_ESTABLISHED
      
      Note the likely(reuse) in reuseport_has_conns_set() is always true,
      but we put the test there for ease of review.  [0]
      
      For the record, usually, sk_reuseport_cb is changed under lock_sock().
      The only exception is reuseport_grow() & TCP reqsk migration case.
      
        1) shutdown() TCP listener, which is moved into the latter part of
           reuse->socks[] to migrate reqsk.
      
        2) New listen() overflows reuse->socks[] and call reuseport_grow().
      
        3) reuse->max_socks overflows u16 with the new listener.
      
        4) reuseport_grow() pops the old shutdown()ed listener from the array
           and update its sk->sk_reuseport_cb as NULL without lock_sock().
      
      shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
      but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
      so likely(reuse) never be false in reuseport_has_conns_set().
      
      [0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/
      
      Fixes: acdcecc6 ("udp: correct reuseport selection with connected sockets")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      69421bf9
  16. 17 10月, 2022 4 次提交
    • P
      perf: Fix missing SIGTRAPs · ca6c2132
      Peter Zijlstra 提交于
      Marco reported:
      
      Due to the implementation of how SIGTRAP are delivered if
      perf_event_attr::sigtrap is set, we've noticed 3 issues:
      
        1. Missing SIGTRAP due to a race with event_sched_out() (more
           details below).
      
        2. Hardware PMU events being disabled due to returning 1 from
           perf_event_overflow(). The only way to re-enable the event is
           for user space to first "properly" disable the event and then
           re-enable it.
      
        3. The inability to automatically disable an event after a
           specified number of overflows via PERF_EVENT_IOC_REFRESH.
      
      The worst of the 3 issues is problem (1), which occurs when a
      pending_disable is "consumed" by a racing event_sched_out(), observed
      as follows:
      
      		CPU0			|	CPU1
      	--------------------------------+---------------------------
      	__perf_event_overflow()		|
      	 perf_event_disable_inatomic()	|
      	  pending_disable = CPU0	| ...
      					| _perf_event_enable()
      					|  event_function_call()
      					|   task_function_call()
      					|    /* sends IPI to CPU0 */
      	<IPI>				| ...
      	 __perf_event_enable()		+---------------------------
      	  ctx_resched()
      	   task_ctx_sched_out()
      	    ctx_sched_out()
      	     group_sched_out()
      	      event_sched_out()
      	       pending_disable = -1
      	</IPI>
      	<IRQ-work>
      	 perf_pending_event()
      	  perf_pending_event_disable()
      	   /* Fails to send SIGTRAP because no pending_disable! */
      	</IRQ-work>
      
      In the above case, not only is that particular SIGTRAP missed, but also
      all future SIGTRAPs because 'event_limit' is not reset back to 1.
      
      To fix, rework pending delivery of SIGTRAP via IRQ-work by introduction
      of a separate 'pending_sigtrap', no longer using 'event_limit' and
      'pending_disable' for its delivery.
      
      Additionally; and different to Marco's proposed patch:
      
       - recognise that pending_disable effectively duplicates oncpu for
         the case where it is set. As such, change the irq_work handler to
         use ->oncpu to target the event and use pending_* as boolean toggles.
      
       - observe that SIGTRAP targets the ctx->task, so the context switch
         optimization that carries contexts between tasks is invalid. If
         the irq_work were delayed enough to hit after a context switch the
         SIGTRAP would be delivered to the wrong task.
      
       - observe that if the event gets scheduled out
         (rotation/migration/context-switch/...) the irq-work would be
         insufficient to deliver the SIGTRAP when the event gets scheduled
         back in (the irq-work might still be pending on the old CPU).
      
         Therefore have event_sched_out() convert the pending sigtrap into a
         task_work which will deliver the signal at return_to_user.
      
      Fixes: 97ba62b2 ("perf: Add support for SIGTRAP on perf events")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Debugged-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NMarco Elver <elver@google.com>
      Debugged-by: NMarco Elver <elver@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMarco Elver <elver@google.com>
      Tested-by: NMarco Elver <elver@google.com>
      ca6c2132
    • W
      counter: Reduce DEFINE_COUNTER_ARRAY_POLARITY() to defining counter_array · 472a1482
      William Breathitt Gray 提交于
      A spare warning was reported for drivers/counter/ti-ecap-capture.c::
      
          sparse warnings: (new ones prefixed by >>)
          >> drivers/counter/ti-ecap-capture.c:380:8: sparse: sparse: symbol 'ecap_cnt_pol_array' was not declared. Should it be static?
      
          vim +/ecap_cnt_pol_array +380 drivers/counter/ti-ecap-capture.c
      
             379
           > 380	static DEFINE_COUNTER_ARRAY_POLARITY(ecap_cnt_pol_array, ecap_cnt_pol_avail, ECAP_NB_CEVT);
             381
      
      The first argument to the DEFINE_COUNTER_ARRAY_POLARITY() macro is a
      token serving as the symbol name in the definition of a new
      struct counter_array structure. However, this macro actually expands to
      two statements::
      
          #define DEFINE_COUNTER_ARRAY_POLARITY(_name, _enums, _length) \
                  DEFINE_COUNTER_AVAILABLE(_name##_available, _enums); \
                  struct counter_array _name = { \
                          .type = COUNTER_COMP_SIGNAL_POLARITY, \
                          .avail = &(_name##_available), \
                          .length = (_length), \
                  }
      
      Because of this, the "static" on line 380 only applies to the first
      statement. This patch splits out the DEFINE_COUNTER_AVAILABLE() line
      and leaves DEFINE_COUNTER_ARRAY_POLARITY() as a simple structure
      definition to avoid issues like this.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/all/202210020619.NQbyomII-lkp@intel.com/
      Cc: Julien Panis <jpanis@baylibre.com>
      Signed-off-by: NWilliam Breathitt Gray <william.gray@linaro.org>
      472a1482
    • K
      fbdev: MIPS supports iomem addresses · 25b72d53
      Kees Cook 提交于
      Add MIPS to fb_* helpers list for iomem addresses. This silences Sparse
      warnings about lacking __iomem address space casts:
      
      drivers/video/fbdev/pvr2fb.c:800:9: sparse: sparse: incorrect type in argument 1 (different address spaces)
      drivers/video/fbdev/pvr2fb.c:800:9: sparse:     expected void const *
      drivers/video/fbdev/pvr2fb.c:800:9: sparse:     got char [noderef] __iomem *screen_base
      Reported-by: Nkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/lkml/202210100209.tR2Iqbqk-lkp@intel.com/
      Cc: Helge Deller <deller@gmx.de>
      Cc: linux-fbdev@vger.kernel.org
      Cc: dri-devel@lists.freedesktop.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      25b72d53
    • T
      Revert "cpumask: fix checking valid cpu range". · 80493877
      Tetsuo Handa 提交于
      This reverts commit 78e5a339 ("cpumask: fix checking valid cpu range").
      
      syzbot is hitting WARN_ON_ONCE(cpu >= nr_cpumask_bits) warning at
      cpu_max_bits_warn() [1], for commit 78e5a339 ("cpumask: fix checking
      valid cpu range") is broken.  Obviously that patch hits WARN_ON_ONCE()
      when e.g.  reading /proc/cpuinfo because passing "cpu + 1" instead of
      "cpu" will trivially hit cpu == nr_cpumask_bits condition.
      
      Although syzbot found this problem in linux-next.git on 2022/09/27 [2],
      this problem was not fixed immediately.  As a result, that patch was
      sent to linux.git before the patch author recognizes this problem, and
      syzbot started failing to test changes in linux.git since 2022/10/10
      [3].
      
      Andrew Jones proposed a fix for x86 and riscv architectures [4].  But
      [2] and [5] indicate that affected locations are not limited to arch
      code.  More delay before we find and fix affected locations, less tested
      kernel (and more difficult to bisect and fix) before release.
      
      We should have inspected and fixed basically all cpumask users before
      applying that patch.  We should not crash kernels in order to ask
      existing cpumask users to update their code, even if limited to
      CONFIG_DEBUG_PER_CPU_MAPS=y case.
      
      Link: https://syzkaller.appspot.com/bug?extid=d0fd2bf0dd6da72496dd [1]
      Link: https://syzkaller.appspot.com/bug?extid=21da700f3c9f0bc40150 [2]
      Link: https://syzkaller.appspot.com/bug?extid=51a652e2d24d53e75734 [3]
      Link: https://lkml.kernel.org/r/20221014155845.1986223-1-ajones@ventanamicro.com [4]
      Link: https://syzkaller.appspot.com/bug?extid=4d46c43d81c3bd155060 [5]
      Reported-by: NAndrew Jones <ajones@ventanamicro.com>
      Reported-by: syzbot+d0fd2bf0dd6da72496dd@syzkaller.appspotmail.com
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Yury Norov <yury.norov@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80493877
  17. 16 10月, 2022 1 次提交
  18. 15 10月, 2022 1 次提交