1. 29 11月, 2022 1 次提交
    • D
      iomap: write iomap validity checks · d7b64041
      Dave Chinner 提交于
      A recent multithreaded write data corruption has been uncovered in
      the iomap write code. The core of the problem is partial folio
      writes can be flushed to disk while a new racing write can map it
      and fill the rest of the page:
      
      writeback			new write
      
      allocate blocks
        blocks are unwritten
      submit IO
      .....
      				map blocks
      				iomap indicates UNWRITTEN range
      				loop {
      				  lock folio
      				  copyin data
      .....
      IO completes
        runs unwritten extent conv
          blocks are marked written
      				  <iomap now stale>
      				  get next folio
      				}
      
      Now add memory pressure such that memory reclaim evicts the
      partially written folio that has already been written to disk.
      
      When the new write finally gets to the last partial page of the new
      write, it does not find it in cache, so it instantiates a new page,
      sees the iomap is unwritten, and zeros the part of the page that
      it does not have data from. This overwrites the data on disk that
      was originally written.
      
      The full description of the corruption mechanism can be found here:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      To solve this problem, we need to check whether the iomap is still
      valid after we lock each folio during the write. We have to do it
      after we lock the page so that we don't end up with state changes
      occurring while we wait for the folio to be locked.
      
      Hence we need a mechanism to be able to check that the cached iomap
      is still valid (similar to what we already do in buffered
      writeback), and we need a way for ->begin_write to back out and
      tell the high level iomap iterator that we need to remap the
      remaining write range.
      
      The iomap needs to grow some storage for the validity cookie that
      the filesystem provides to travel with the iomap. XFS, in
      particular, also needs to know some more information about what the
      iomap maps (attribute extents rather than file data extents) to for
      the validity cookie to cover all the types of iomaps we might need
      to validate.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      d7b64041
  2. 23 11月, 2022 1 次提交
    • D
      xfs,iomap: move delalloc punching to iomap · 9c7babf9
      Dave Chinner 提交于
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have 
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      9c7babf9
  3. 29 10月, 2022 3 次提交
  4. 28 10月, 2022 1 次提交
    • T
      net/mlx5: Fix possible use-after-free in async command interface · bacd22df
      Tariq Toukan 提交于
      mlx5_cmd_cleanup_async_ctx should return only after all its callback
      handlers were completed. Before this patch, the below race between
      mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and
      lead to a use-after-free:
      
      1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e.
         elevated by 1, a single inflight callback).
      2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1.
      3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and
         is about to call wake_up().
      4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns
         immediately as the condition (num_inflight == 0) holds.
      5. mlx5_cmd_cleanup_async_ctx returns.
      6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx
         object.
      7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed
         object.
      
      Fix it by syncing using a completion object. Mark it completed when
      num_inflight reaches 0.
      
      Trace:
      
      BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270
      Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0
      
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <IRQ>
       dump_stack_lvl+0x57/0x7d
       print_report.cold+0x2d5/0x684
       ? do_raw_spin_lock+0x23d/0x270
       kasan_report+0xb1/0x1a0
       ? do_raw_spin_lock+0x23d/0x270
       do_raw_spin_lock+0x23d/0x270
       ? rwlock_bug.part.0+0x90/0x90
       ? __delete_object+0xb8/0x100
       ? lock_downgrade+0x6e0/0x6e0
       _raw_spin_lock_irqsave+0x43/0x60
       ? __wake_up_common_lock+0xb9/0x140
       __wake_up_common_lock+0xb9/0x140
       ? __wake_up_common+0x650/0x650
       ? destroy_tis_callback+0x53/0x70 [mlx5_core]
       ? kasan_set_track+0x21/0x30
       ? destroy_tis_callback+0x53/0x70 [mlx5_core]
       ? kfree+0x1ba/0x520
       ? do_raw_spin_unlock+0x54/0x220
       mlx5_cmd_exec_cb_handler+0x136/0x1a0 [mlx5_core]
       ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
       ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
       mlx5_cmd_comp_handler+0x65a/0x12b0 [mlx5_core]
       ? dump_command+0xcc0/0xcc0 [mlx5_core]
       ? lockdep_hardirqs_on_prepare+0x400/0x400
       ? cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
       cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
       atomic_notifier_call_chain+0xd7/0x1d0
       mlx5_eq_async_int+0x3ce/0xa20 [mlx5_core]
       atomic_notifier_call_chain+0xd7/0x1d0
       ? irq_release+0x140/0x140 [mlx5_core]
       irq_int_handler+0x19/0x30 [mlx5_core]
       __handle_irq_event_percpu+0x1f2/0x620
       handle_irq_event+0xb2/0x1d0
       handle_edge_irq+0x21e/0xb00
       __common_interrupt+0x79/0x1a0
       common_interrupt+0x78/0xa0
       </IRQ>
       <TASK>
       asm_common_interrupt+0x22/0x40
      RIP: 0010:default_idle+0x42/0x60
      Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 eb 47 22 02 85 c0 7e 07 0f 00 2d e0 9f 48 00 fb f4 <c3> 48 c7 c7 80 08 7f 85 e8 d1 d3 3e fe eb de 66 66 2e 0f 1f 84 00
      RSP: 0018:ffff888100dbfdf0 EFLAGS: 00000242
      RAX: 0000000000000001 RBX: ffffffff84ecbd48 RCX: 1ffffffff0afe110
      RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835cc9bc
      RBP: 0000000000000005 R08: 0000000000000001 R09: ffff88881dec4ac3
      R10: ffffed1103bd8958 R11: 0000017d0ca571c9 R12: 0000000000000005
      R13: ffffffff84f024e0 R14: 0000000000000000 R15: dffffc0000000000
       ? default_idle_call+0xcc/0x450
       default_idle_call+0xec/0x450
       do_idle+0x394/0x450
       ? arch_cpu_idle_exit+0x40/0x40
       ? do_idle+0x17/0x450
       cpu_startup_entry+0x19/0x20
       start_secondary+0x221/0x2b0
       ? set_cpu_sibling_map+0x2070/0x2070
       secondary_startup_64_no_verify+0xcd/0xdb
       </TASK>
      
      Allocated by task 49502:
       kasan_save_stack+0x1e/0x40
       __kasan_kmalloc+0x81/0xa0
       kvmalloc_node+0x48/0xe0
       mlx5e_bulk_async_init+0x35/0x110 [mlx5_core]
       mlx5e_tls_priv_tx_list_cleanup+0x84/0x3e0 [mlx5_core]
       mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
       mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
       mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
       mlx5e_suspend+0xdb/0x140 [mlx5_core]
       mlx5e_remove+0x89/0x190 [mlx5_core]
       auxiliary_bus_remove+0x52/0x70
       device_release_driver_internal+0x40f/0x650
       driver_detach+0xc1/0x180
       bus_remove_driver+0x125/0x2f0
       auxiliary_driver_unregister+0x16/0x50
       mlx5e_cleanup+0x26/0x30 [mlx5_core]
       cleanup+0xc/0x4e [mlx5_core]
       __x64_sys_delete_module+0x2b5/0x450
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Freed by task 49502:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       kasan_set_free_info+0x20/0x30
       ____kasan_slab_free+0x11d/0x1b0
       kfree+0x1ba/0x520
       mlx5e_tls_priv_tx_list_cleanup+0x2e7/0x3e0 [mlx5_core]
       mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
       mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
       mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
       mlx5e_suspend+0xdb/0x140 [mlx5_core]
       mlx5e_remove+0x89/0x190 [mlx5_core]
       auxiliary_bus_remove+0x52/0x70
       device_release_driver_internal+0x40f/0x650
       driver_detach+0xc1/0x180
       bus_remove_driver+0x125/0x2f0
       auxiliary_driver_unregister+0x16/0x50
       mlx5e_cleanup+0x26/0x30 [mlx5_core]
       cleanup+0xc/0x4e [mlx5_core]
       __x64_sys_delete_module+0x2b5/0x450
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: e355477e ("net/mlx5: Make mlx5_cmd_exec_cb() a safe API")
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221026135153.154807-8-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      bacd22df
  5. 27 10月, 2022 2 次提交
    • M
      blk-mq: don't add non-pt request with ->end_io to batch · 2d87d455
      Ming Lei 提交于
      dm-rq implements ->end_io callback for request issued to underlying queue,
      and it isn't passthrough request.
      
      Commit ab3e1d3b ("block: allow end_io based requests in the completion
      batch handling") doesn't clear rq->bio and rq->__data_len for request
      with ->end_io in blk_mq_end_request_batch(), and this way is actually
      dangerous, but so far it is only for nvme passthrough request.
      
      dm-rq needs to clean up remained bios in case of partial completion,
      and req->bio is required, then use-after-free is triggered, so the
      underlying clone request can't be completed in blk_mq_end_request_batch.
      
      Fix panic by not adding such request into batch list, and the issue
      can be triggered simply by exposing nvme pci to dm-mpath simply.
      
      Fixes: ab3e1d3b ("block: allow end_io based requests in the completion batch handling")
      Cc: dm-devel@redhat.com
      Cc: Mike Snitzer <snitzer@kernel.org>
      Reported-by: NChanghui Zhong <czhong@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20221027085709.513175-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      2d87d455
    • M
      KVM: Initialize gfn_to_pfn_cache locks in dedicated helper · 52491a38
      Michal Luczaj 提交于
      Move the gfn_to_pfn_cache lock initialization to another helper and
      call the new helper during VM/vCPU creation.  There are race
      conditions possible due to kvm_gfn_to_pfn_cache_init()'s
      ability to re-initialize the cache's locks.
      
      For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and
      kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock.
      
                      (thread 1)                |           (thread 2)
                                                |
       kvm_xen_set_evtchn_fast                  |
        read_lock_irqsave(&gpc->lock, ...)      |
                                                | kvm_gfn_to_pfn_cache_init
                                                |  rwlock_init(&gpc->lock)
        read_unlock_irqrestore(&gpc->lock, ...) |
      
      Rename "cache_init" and "cache_destroy" to activate+deactivate to
      avoid implying that the cache really is destroyed/freed.
      
      Note, there more races in the newly named kvm_gpc_activate() that will
      be addressed separately.
      
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Cc: stable@vger.kernel.org
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMichal Luczaj <mhal@rbox.co>
      [sean: call out that this is a bug fix]
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20221013211234.1318131-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      52491a38
  6. 26 10月, 2022 1 次提交
  7. 24 10月, 2022 2 次提交
  8. 22 10月, 2022 2 次提交
  9. 21 10月, 2022 3 次提交
  10. 20 10月, 2022 1 次提交
  11. 17 10月, 2022 4 次提交
    • P
      perf: Fix missing SIGTRAPs · ca6c2132
      Peter Zijlstra 提交于
      Marco reported:
      
      Due to the implementation of how SIGTRAP are delivered if
      perf_event_attr::sigtrap is set, we've noticed 3 issues:
      
        1. Missing SIGTRAP due to a race with event_sched_out() (more
           details below).
      
        2. Hardware PMU events being disabled due to returning 1 from
           perf_event_overflow(). The only way to re-enable the event is
           for user space to first "properly" disable the event and then
           re-enable it.
      
        3. The inability to automatically disable an event after a
           specified number of overflows via PERF_EVENT_IOC_REFRESH.
      
      The worst of the 3 issues is problem (1), which occurs when a
      pending_disable is "consumed" by a racing event_sched_out(), observed
      as follows:
      
      		CPU0			|	CPU1
      	--------------------------------+---------------------------
      	__perf_event_overflow()		|
      	 perf_event_disable_inatomic()	|
      	  pending_disable = CPU0	| ...
      					| _perf_event_enable()
      					|  event_function_call()
      					|   task_function_call()
      					|    /* sends IPI to CPU0 */
      	<IPI>				| ...
      	 __perf_event_enable()		+---------------------------
      	  ctx_resched()
      	   task_ctx_sched_out()
      	    ctx_sched_out()
      	     group_sched_out()
      	      event_sched_out()
      	       pending_disable = -1
      	</IPI>
      	<IRQ-work>
      	 perf_pending_event()
      	  perf_pending_event_disable()
      	   /* Fails to send SIGTRAP because no pending_disable! */
      	</IRQ-work>
      
      In the above case, not only is that particular SIGTRAP missed, but also
      all future SIGTRAPs because 'event_limit' is not reset back to 1.
      
      To fix, rework pending delivery of SIGTRAP via IRQ-work by introduction
      of a separate 'pending_sigtrap', no longer using 'event_limit' and
      'pending_disable' for its delivery.
      
      Additionally; and different to Marco's proposed patch:
      
       - recognise that pending_disable effectively duplicates oncpu for
         the case where it is set. As such, change the irq_work handler to
         use ->oncpu to target the event and use pending_* as boolean toggles.
      
       - observe that SIGTRAP targets the ctx->task, so the context switch
         optimization that carries contexts between tasks is invalid. If
         the irq_work were delayed enough to hit after a context switch the
         SIGTRAP would be delivered to the wrong task.
      
       - observe that if the event gets scheduled out
         (rotation/migration/context-switch/...) the irq-work would be
         insufficient to deliver the SIGTRAP when the event gets scheduled
         back in (the irq-work might still be pending on the old CPU).
      
         Therefore have event_sched_out() convert the pending sigtrap into a
         task_work which will deliver the signal at return_to_user.
      
      Fixes: 97ba62b2 ("perf: Add support for SIGTRAP on perf events")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Debugged-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NMarco Elver <elver@google.com>
      Debugged-by: NMarco Elver <elver@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMarco Elver <elver@google.com>
      Tested-by: NMarco Elver <elver@google.com>
      ca6c2132
    • W
      counter: Reduce DEFINE_COUNTER_ARRAY_POLARITY() to defining counter_array · 472a1482
      William Breathitt Gray 提交于
      A spare warning was reported for drivers/counter/ti-ecap-capture.c::
      
          sparse warnings: (new ones prefixed by >>)
          >> drivers/counter/ti-ecap-capture.c:380:8: sparse: sparse: symbol 'ecap_cnt_pol_array' was not declared. Should it be static?
      
          vim +/ecap_cnt_pol_array +380 drivers/counter/ti-ecap-capture.c
      
             379
           > 380	static DEFINE_COUNTER_ARRAY_POLARITY(ecap_cnt_pol_array, ecap_cnt_pol_avail, ECAP_NB_CEVT);
             381
      
      The first argument to the DEFINE_COUNTER_ARRAY_POLARITY() macro is a
      token serving as the symbol name in the definition of a new
      struct counter_array structure. However, this macro actually expands to
      two statements::
      
          #define DEFINE_COUNTER_ARRAY_POLARITY(_name, _enums, _length) \
                  DEFINE_COUNTER_AVAILABLE(_name##_available, _enums); \
                  struct counter_array _name = { \
                          .type = COUNTER_COMP_SIGNAL_POLARITY, \
                          .avail = &(_name##_available), \
                          .length = (_length), \
                  }
      
      Because of this, the "static" on line 380 only applies to the first
      statement. This patch splits out the DEFINE_COUNTER_AVAILABLE() line
      and leaves DEFINE_COUNTER_ARRAY_POLARITY() as a simple structure
      definition to avoid issues like this.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/all/202210020619.NQbyomII-lkp@intel.com/
      Cc: Julien Panis <jpanis@baylibre.com>
      Signed-off-by: NWilliam Breathitt Gray <william.gray@linaro.org>
      472a1482
    • K
      fbdev: MIPS supports iomem addresses · 25b72d53
      Kees Cook 提交于
      Add MIPS to fb_* helpers list for iomem addresses. This silences Sparse
      warnings about lacking __iomem address space casts:
      
      drivers/video/fbdev/pvr2fb.c:800:9: sparse: sparse: incorrect type in argument 1 (different address spaces)
      drivers/video/fbdev/pvr2fb.c:800:9: sparse:     expected void const *
      drivers/video/fbdev/pvr2fb.c:800:9: sparse:     got char [noderef] __iomem *screen_base
      Reported-by: Nkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/lkml/202210100209.tR2Iqbqk-lkp@intel.com/
      Cc: Helge Deller <deller@gmx.de>
      Cc: linux-fbdev@vger.kernel.org
      Cc: dri-devel@lists.freedesktop.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      25b72d53
    • T
      Revert "cpumask: fix checking valid cpu range". · 80493877
      Tetsuo Handa 提交于
      This reverts commit 78e5a339 ("cpumask: fix checking valid cpu range").
      
      syzbot is hitting WARN_ON_ONCE(cpu >= nr_cpumask_bits) warning at
      cpu_max_bits_warn() [1], for commit 78e5a339 ("cpumask: fix checking
      valid cpu range") is broken.  Obviously that patch hits WARN_ON_ONCE()
      when e.g.  reading /proc/cpuinfo because passing "cpu + 1" instead of
      "cpu" will trivially hit cpu == nr_cpumask_bits condition.
      
      Although syzbot found this problem in linux-next.git on 2022/09/27 [2],
      this problem was not fixed immediately.  As a result, that patch was
      sent to linux.git before the patch author recognizes this problem, and
      syzbot started failing to test changes in linux.git since 2022/10/10
      [3].
      
      Andrew Jones proposed a fix for x86 and riscv architectures [4].  But
      [2] and [5] indicate that affected locations are not limited to arch
      code.  More delay before we find and fix affected locations, less tested
      kernel (and more difficult to bisect and fix) before release.
      
      We should have inspected and fixed basically all cpumask users before
      applying that patch.  We should not crash kernels in order to ask
      existing cpumask users to update their code, even if limited to
      CONFIG_DEBUG_PER_CPU_MAPS=y case.
      
      Link: https://syzkaller.appspot.com/bug?extid=d0fd2bf0dd6da72496dd [1]
      Link: https://syzkaller.appspot.com/bug?extid=21da700f3c9f0bc40150 [2]
      Link: https://syzkaller.appspot.com/bug?extid=51a652e2d24d53e75734 [3]
      Link: https://lkml.kernel.org/r/20221014155845.1986223-1-ajones@ventanamicro.com [4]
      Link: https://syzkaller.appspot.com/bug?extid=4d46c43d81c3bd155060 [5]
      Reported-by: NAndrew Jones <ajones@ventanamicro.com>
      Reported-by: syzbot+d0fd2bf0dd6da72496dd@syzkaller.appspotmail.com
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Yury Norov <yury.norov@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80493877
  12. 16 10月, 2022 1 次提交
  13. 15 10月, 2022 3 次提交
  14. 14 10月, 2022 2 次提交
  15. 13 10月, 2022 7 次提交
  16. 12 10月, 2022 5 次提交
    • B
      mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page · fac35ba7
      Baolin Wang 提交于
      On some architectures (like ARM64), it can support CONT-PTE/PMD size
      hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
      1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.
      
      So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
      use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
      hugetlb in follow_page_pte().  However this pte entry lock is incorrect
      for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
      the correct lock, which is mm->page_table_lock.
      
      That means the pte entry of the CONT-PTE size hugetlb under current pte
      lock is unstable in follow_page_pte(), we can continue to migrate or
      poison the pte entry of the CONT-PTE size hugetlb, which can cause some
      potential race issues, even though they are under the 'pte lock'.
      
      For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
      page by move_pages() syscall under the lock, however antoher thread B can
      migrate the CONT-PTE hugetlb page at the same time, which will cause
      thread A to get an incorrect page, if thread A also wants to do page
      migration, then data inconsistency error occurs.
      
      Moreover we have the same issue for CONT-PMD size hugetlb in
      follow_huge_pmd().
      
      To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
      to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
      get the correct pte entry lock to make the pte entry stable.
      
      Mike said:
      
      Support for CONT_PMD/_PTE was added with bb9dd3df ("arm64: hugetlb:
      refactor find_num_contig()").  Patch series "Support for contiguous pte
      hugepages", v4.  However, I do not believe these code paths were
      executed until migration support was added with 5480280d ("arm64/mm:
      enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
      with 5480280d for the Fixes: targe.
      
      Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com
      Fixes: 5480280d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
      Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fac35ba7
    • T
      include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype · 6a961bff
      Tiezhu Yang 提交于
      The argument has_signal of arch_do_signal_or_restart() has been removed in
      commit 8ba62d37 ("task_work: Call tracehook_notify_signal from
      get_signal on all architectures"), let us remove the related comment.
      
      Link: https://lkml.kernel.org/r/1662090106-5545-1-git-send-email-yangtiezhu@loongson.cn
      Fixes: 8ba62d37 ("task_work: Call tracehook_notify_signal from get_signal on all architectures")
      Signed-off-by: NTiezhu Yang <yangtiezhu@loongson.cn>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6a961bff
    • J
      prandom: remove unused functions · de492c83
      Jason A. Donenfeld 提交于
      With no callers left of prandom_u32() and prandom_bytes(), as well as
      get_random_int(), remove these deprecated wrappers, in favor of
      get_random_u32() and get_random_bytes().
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      de492c83
    • J
      treewide: use prandom_u32_max() when possible, part 1 · 81895a65
      Jason A. Donenfeld 提交于
      Rather than incurring a division or requesting too many random bytes for
      the given range, use the prandom_u32_max() function, which only takes
      the minimum required bytes from the RNG and avoids divisions. This was
      done mechanically with this coccinelle script:
      
      @basic@
      expression E;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u64;
      @@
      (
      - ((T)get_random_u32() % (E))
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ((E) - 1))
      + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
      |
      - ((u64)(E) * get_random_u32() >> 32)
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ~PAGE_MASK)
      + prandom_u32_max(PAGE_SIZE)
      )
      
      @multi_line@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      identifier RAND;
      expression E;
      @@
      
      -       RAND = get_random_u32();
              ... when != RAND
      -       RAND %= (E);
      +       RAND = prandom_u32_max(E);
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Add one to the literal.
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
              print("Skipping 0x%x for cleanup elsewhere" % (value))
              cocci.include_match(False)
      elif value & (value + 1) != 0:
              print("Skipping 0x%x because it's not a power of two minus one" % (value))
              cocci.include_match(False)
      elif literal.startswith('0x'):
              coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
      else:
              coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      expression add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       prandom_u32_max(RESULT)
      
      @collapse_ret@
      type T;
      identifier VAR;
      expression E;
      @@
      
       {
      -       T VAR;
      -       VAR = (E);
      -       return VAR;
      +       return E;
       }
      
      @drop_var@
      type T;
      identifier VAR;
      @@
      
       {
      -       T VAR;
              ... when != VAR
       }
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NKP Singh <kpsingh@kernel.org>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
      Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      81895a65
    • Y
      cgroup: add cgroup_v1v2_get_from_[fd/file]() · a6d1ce59
      Yosry Ahmed 提交于
      Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that
      support both cgroup1 and cgroup2.
      Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a6d1ce59
  17. 10 10月, 2022 1 次提交