1. 25 9月, 2022 2 次提交
    • J
      LoongArch: Use acpi_arch_dma_setup() and remove ARCH_HAS_PHYS_TO_DMA · c78c43fe
      Jianmin Lv 提交于
      Use _DMA defined in ACPI spec for translation between
      DMA address and CPU address, and implement acpi_arch_dma_setup
      for initializing dev->dma_range_map, where acpi_dma_get_range
      is called for parsing _DMA.
      
      e.g.
      If we have two dma ranges:
      cpu address      dma address    size         offset
      0x200080000000   0x2080000000   0x400000000  0x1fe000000000
      0x400080000000   0x4080000000   0x400000000  0x3fc000000000
      
      _DMA for pci devices should be declared in host bridge as
      flowing:
      
      Name (_DMA, ResourceTemplate() {
              QWordMemory (ResourceProducer,
                  PosDecode,
                  MinFixed,
                  MaxFixed,
                  NonCacheable,
                  ReadWrite,
                  0x0,
                  0x4080000000,
                  0x447fffffff,
                  0x3fc000000000,
                  0x400000000,
                  ,
                  ,
                  )
      
              QWordMemory (ResourceProducer,
                  PosDecode,
                  MinFixed,
                  MaxFixed,
                  NonCacheable,
                  ReadWrite,
                  0x0,
                  0x2080000000,
                  0x247fffffff,
                  0x1fe000000000,
                  0x400000000,
                  ,
                  ,
                  )
          })
      Acked-by: NHuacai Chen <chenhuacai@loongson.cn>
      Signed-off-by: NJianmin Lv <lvjianmin@loongson.cn>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c78c43fe
    • J
      ACPI: scan: Support multiple DMA windows with different offsets · bf2ee8d0
      Jianmin Lv 提交于
      In DT systems configurations, of_dma_get_range() returns struct
      bus_dma_region DMA regions; they are used to set-up devices
      DMA windows with different offset available for translation between DMA
      address and CPU address.
      
      In ACPI systems configuration, acpi_dma_get_range() does not return
      DMA regions yet and that precludes setting up the dev->dma_range_map
      pointer and therefore DMA regions with multiple offsets.
      
      Update acpi_dma_get_range() to return struct bus_dma_region
      DMA regions like of_dma_get_range() does.
      
      After updating acpi_dma_get_range(), acpi_arch_dma_setup() is changed for
      ARM64, where the original dma_addr and size are removed as these
      arguments are now redundant, and pass 0 and U64_MAX for dma_base
      and size of arch_setup_dma_ops; this is a simplification consistent
      with what other ACPI architectures also pass to iommu_setup_dma_ops().
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NJianmin Lv <lvjianmin@loongson.cn>
      Reviewed-by: NLorenzo Pieralisi <lpieralisi@kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      bf2ee8d0
  2. 13 9月, 2022 1 次提交
  3. 11 9月, 2022 1 次提交
  4. 08 9月, 2022 1 次提交
    • L
      fs: only do a memory barrier for the first set_buffer_uptodate() · 2f79cdfe
      Linus Torvalds 提交于
      Commit d4252071 ("add barriers to buffer_uptodate and
      set_buffer_uptodate") added proper memory barriers to the buffer head
      BH_Uptodate bit, so that anybody who tests a buffer for being up-to-date
      will be guaranteed to actually see initialized state.
      
      However, that commit didn't _just_ add the memory barrier, it also ended
      up dropping the "was it already set" logic that the BUFFER_FNS() macro
      had.
      
      That's conceptually the right thing for a generic "this is a memory
      barrier" operation, but in the case of the buffer contents, we really
      only care about the memory barrier for the _first_ time we set the bit,
      in that the only memory ordering protection we need is to avoid anybody
      seeing uninitialized memory contents.
      
      Any other access ordering wouldn't be about the BH_Uptodate bit anyway,
      and would require some other proper lock (typically BH_Lock or the folio
      lock).  A reader that races with somebody invalidating the buffer head
      isn't an issue wrt the memory ordering, it's a serialization issue.
      
      Now, you'd think that the buffer head operations don't matter in this
      day and age (and I certainly thought so), but apparently some loads
      still end up being heavy users of buffer heads.  In particular, the
      kernel test robot reported that not having this bit access optimization
      in place caused a noticeable direct IO performance regression on ext4:
      
        fxmark.ssd_ext4_no_jnl_DWTL_54_directio.works/sec -26.5% regression
      
      although you presumably need a fast disk and a lot of cores to actually
      notice.
      
      Link: https://lore.kernel.org/all/Yw8L7HTZ%2FdE2%2Fo9C@xsang-OptiPlex-9020/Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Tested-by: NFengwei Yin <fengwei.yin@intel.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f79cdfe
  5. 07 9月, 2022 1 次提交
  6. 06 9月, 2022 1 次提交
  7. 05 9月, 2022 2 次提交
  8. 03 9月, 2022 1 次提交
  9. 02 9月, 2022 2 次提交
    • M
      spi: mux: Fix mux interaction with fast path optimisations · b30f7c8e
      Mark Brown 提交于
      The spi-mux driver is rather too clever and attempts to resubmit any
      message that is submitted to it to the parent controller with some
      adjusted callbacks.  This does not play at all nicely with the fast
      path which now sets flags on the message indicating that it's being
      handled through the fast path, we see async messages flagged as being on
      the fast path.  Ideally the spi-mux code would duplicate the message but
      that's rather invasive and a bit fragile in that it relies on the mux
      knowing which fields in the message to copy.  Instead teach the core
      that there are controllers which can't cope with the fast path and have
      the mux flag itself as being such a controller, ensuring that messages
      going via the mux don't get partially handled via the fast path.
      
      This will reduce the performance of any spi-mux connected device since
      we'll now always use the thread for both the actual controller and the
      mux controller instead of just the actual controller but given that we
      were always hitting the slow path anyway it's hopefully not too much of
      an additional cost and it allows us to keep the fast path.
      
      Fixes: ae7d2346 ("spi: Don't use the message queue if possible in spi_sync")
      Reported-by: NCasper Andersson <casper.casan@gmail.com>
      Tested-by: NCasper Andersson <casper.casan@gmail.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      Link: https://lore.kernel.org/r/20220901120732.49245-1-broonie@kernel.orgSigned-off-by: NMark Brown <broonie@kernel.org>
      b30f7c8e
    • E
      tcp: TX zerocopy should not sense pfmemalloc status · 32614006
      Eric Dumazet 提交于
      We got a recent syzbot report [1] showing a possible misuse
      of pfmemalloc page status in TCP zerocopy paths.
      
      Indeed, for pages coming from user space or other layers,
      using page_is_pfmemalloc() is moot, and possibly could give
      false positives.
      
      There has been attempts to make page_is_pfmemalloc() more robust,
      but not using it in the first place in this context is probably better,
      removing cpu cycles.
      
      Note to stable teams :
      
      You need to backport 84ce071e ("net: introduce
      __skb_fill_page_desc_noacc") as a prereq.
      
      Race is more probable after commit c07aea3e
      ("mm: add a signature in struct page") because page_is_pfmemalloc()
      is now using low order bit from page->lru.next, which can change
      more often than page->index.
      
      Low order bit should never be set for lru.next (when used as an anchor
      in LRU list), so KCSAN report is mostly a false positive.
      
      Backporting to older kernel versions seems not necessary.
      
      [1]
      BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag
      
      write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0:
      __list_add include/linux/list.h:73 [inline]
      list_add include/linux/list.h:88 [inline]
      lruvec_add_folio include/linux/mm_inline.h:105 [inline]
      lru_add_fn+0x440/0x520 mm/swap.c:228
      folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246
      folio_batch_add_and_move mm/swap.c:263 [inline]
      folio_add_lru+0xf1/0x140 mm/swap.c:490
      filemap_add_folio+0xf8/0x150 mm/filemap.c:948
      __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981
      pagecache_get_page+0x26/0x190 mm/folio-compat.c:104
      grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116
      ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988
      generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738
      ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270
      ext4_file_write_iter+0x2e3/0x1210
      call_write_iter include/linux/fs.h:2187 [inline]
      new_sync_write fs/read_write.c:491 [inline]
      vfs_write+0x468/0x760 fs/read_write.c:578
      ksys_write+0xe8/0x1a0 fs/read_write.c:631
      __do_sys_write fs/read_write.c:643 [inline]
      __se_sys_write fs/read_write.c:640 [inline]
      __x64_sys_write+0x3e/0x50 fs/read_write.c:640
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1:
      page_is_pfmemalloc include/linux/mm.h:1740 [inline]
      __skb_fill_page_desc include/linux/skbuff.h:2422 [inline]
      skb_fill_page_desc include/linux/skbuff.h:2443 [inline]
      tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018
      do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075
      tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline]
      tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150
      inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833
      kernel_sendpage+0x184/0x300 net/socket.c:3561
      sock_sendpage+0x5a/0x70 net/socket.c:1054
      pipe_to_sendpage+0x128/0x160 fs/splice.c:361
      splice_from_pipe_feed fs/splice.c:415 [inline]
      __splice_from_pipe+0x222/0x4d0 fs/splice.c:559
      splice_from_pipe fs/splice.c:594 [inline]
      generic_splice_sendpage+0x89/0xc0 fs/splice.c:743
      do_splice_from fs/splice.c:764 [inline]
      direct_splice_actor+0x80/0xa0 fs/splice.c:931
      splice_direct_to_actor+0x305/0x620 fs/splice.c:886
      do_splice_direct+0xfb/0x180 fs/splice.c:974
      do_sendfile+0x3bf/0x910 fs/read_write.c:1249
      __do_sys_sendfile64 fs/read_write.c:1317 [inline]
      __se_sys_sendfile64 fs/read_write.c:1303 [inline]
      __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x0000000000000000 -> 0xffffea0004a1d288
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
      
      Fixes: c07aea3e ("mm: add a signature in struct page")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32614006
  10. 01 9月, 2022 2 次提交
    • D
      rxrpc: Fix ICMP/ICMP6 error handling · ac56a0b4
      David Howells 提交于
      Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing
      it to siphon off UDP packets early in the handling of received UDP packets
      thereby avoiding the packet going through the UDP receive queue, it doesn't
      get ICMP packets through the UDP ->sk_error_report() callback.  In fact, it
      doesn't appear that there's any usable option for getting hold of ICMP
      packets.
      
      Fix this by adding a new UDP encap hook to distribute error messages for
      UDP tunnels.  If the hook is set, then the tunnel driver will be able to
      see ICMP packets.  The hook provides the offset into the packet of the UDP
      header of the original packet that caused the notification.
      
      An alternative would be to call the ->error_handler() hook - but that
      requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error()
      do, though isn't really necessary or desirable in rxrpc's case is we want
      to parse them there and then, not queue them).
      
      Changes
      =======
      ver #3)
       - Fixed an uninitialised variable.
      
      ver #2)
       - Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals.
      
      Fixes: 5271953c ("rxrpc: Use the UDP encap_rcv hook")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ac56a0b4
    • J
      mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse · 2555283e
      Jann Horn 提交于
      anon_vma->degree tracks the combined number of child anon_vmas and VMAs
      that use the anon_vma as their ->anon_vma.
      
      anon_vma_clone() then assumes that for any anon_vma attached to
      src->anon_vma_chain other than src->anon_vma, it is impossible for it to
      be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
      elevated by 1 because of a child anon_vma, meaning that if ->degree
      equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
      
      This assumption is wrong because the ->degree optimization leads to leaf
      nodes being abandoned on anon_vma_clone() - an existing anon_vma is
      reused and no new parent-child relationship is created.  So it is
      possible to reuse an anon_vma for one VMA while it is still tied to
      another VMA.
      
      This is an issue because is_mergeable_anon_vma() and its callers assume
      that if two VMAs have the same ->anon_vma, the list of anon_vmas
      attached to the VMAs is guaranteed to be the same.  When this assumption
      is violated, vma_merge() can merge pages into a VMA that is not attached
      to the corresponding anon_vma, leading to dangling page->mapping
      pointers that will be dereferenced during rmap walks.
      
      Fix it by separately tracking the number of child anon_vmas and the
      number of VMAs using the anon_vma as their ->anon_vma.
      
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Cc: stable@kernel.org
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2555283e
  11. 31 8月, 2022 1 次提交
  12. 30 8月, 2022 3 次提交
    • A
      USB: core: Prevent nested device-reset calls · 9c6d7788
      Alan Stern 提交于
      Automatic kernel fuzzing revealed a recursive locking violation in
      usb-storage:
      
      ============================================
      WARNING: possible recursive locking detected
      5.18.0 #3 Not tainted
      --------------------------------------------
      kworker/1:3/1205 is trying to acquire lock:
      ffff888018638db8 (&us_interface_key[i]){+.+.}-{3:3}, at:
      usb_stor_pre_reset+0x35/0x40 drivers/usb/storage/usb.c:230
      
      but task is already holding lock:
      ffff888018638db8 (&us_interface_key[i]){+.+.}-{3:3}, at:
      usb_stor_pre_reset+0x35/0x40 drivers/usb/storage/usb.c:230
      
      ...
      
      stack backtrace:
      CPU: 1 PID: 1205 Comm: kworker/1:3 Not tainted 5.18.0 #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Workqueue: usb_hub_wq hub_event
      Call Trace:
      <TASK>
      __dump_stack lib/dump_stack.c:88 [inline]
      dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
      print_deadlock_bug kernel/locking/lockdep.c:2988 [inline]
      check_deadlock kernel/locking/lockdep.c:3031 [inline]
      validate_chain kernel/locking/lockdep.c:3816 [inline]
      __lock_acquire.cold+0x152/0x3ca kernel/locking/lockdep.c:5053
      lock_acquire kernel/locking/lockdep.c:5665 [inline]
      lock_acquire+0x1ab/0x520 kernel/locking/lockdep.c:5630
      __mutex_lock_common kernel/locking/mutex.c:603 [inline]
      __mutex_lock+0x14f/0x1610 kernel/locking/mutex.c:747
      usb_stor_pre_reset+0x35/0x40 drivers/usb/storage/usb.c:230
      usb_reset_device+0x37d/0x9a0 drivers/usb/core/hub.c:6109
      r871xu_dev_remove+0x21a/0x270 drivers/staging/rtl8712/usb_intf.c:622
      usb_unbind_interface+0x1bd/0x890 drivers/usb/core/driver.c:458
      device_remove drivers/base/dd.c:545 [inline]
      device_remove+0x11f/0x170 drivers/base/dd.c:537
      __device_release_driver drivers/base/dd.c:1222 [inline]
      device_release_driver_internal+0x1a7/0x2f0 drivers/base/dd.c:1248
      usb_driver_release_interface+0x102/0x180 drivers/usb/core/driver.c:627
      usb_forced_unbind_intf+0x4d/0xa0 drivers/usb/core/driver.c:1118
      usb_reset_device+0x39b/0x9a0 drivers/usb/core/hub.c:6114
      
      This turned out not to be an error in usb-storage but rather a nested
      device reset attempt.  That is, as the rtl8712 driver was being
      unbound from a composite device in preparation for an unrelated USB
      reset (that driver does not have pre_reset or post_reset callbacks),
      its ->remove routine called usb_reset_device() -- thus nesting one
      reset call within another.
      
      Performing a reset as part of disconnect processing is a questionable
      practice at best.  However, the bug report points out that the USB
      core does not have any protection against nested resets.  Adding a
      reset_in_progress flag and testing it will prevent such errors in the
      future.
      
      Link: https://lore.kernel.org/all/CAB7eexKUpvX-JNiLzhXBDWgfg2T9e9_0Tw4HQ6keN==voRbP0g@mail.gmail.com/
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: NRondreis <linhaoguo86@gmail.com>
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Link: https://lore.kernel.org/r/YwkflDxvg0KWqyZK@rowland.harvard.eduSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c6d7788
    • I
      ARM: 9229/1: amba: Fix use-after-free in amba_read_periphid() · 25af7406
      Isaac Manjarres 提交于
      After commit f2d3b9a4 ("ARM: 9220/1: amba: Remove deferred device
      addition"), it became possible for amba_read_periphid() to be invoked
      concurrently from two threads for a particular AMBA device.
      
      Consider the case where a thread (T0) is registering an AMBA driver, and
      searching for all of the devices it can match with on the AMBA bus.
      Suppose that another thread (T1) is executing the deferred probe work,
      and is searching through all of the AMBA drivers on the bus for a driver
      that matches a particular AMBA device. Assume that both threads begin
      operating on the same AMBA device and the device's peripheral ID is
      still unknown.
      
      In this scenario, the amba_match() function will be invoked for the
      same AMBA device by both threads, which means amba_read_periphid()
      can also be invoked by both threads, and both threads will be able
      to manipulate the AMBA device's pclk pointer without any synchronization.
      It's possible that one thread will initialize the pclk pointer, then the
      other thread will re-initialize it, overwriting the previous value, and
      both will race to free the same pclk, resulting in a use-after-free for
      whichever thread frees the pclk last.
      
      Add a lock per AMBA device to synchronize the handling with detecting the
      peripheral ID to avoid the use-after-free scenario.
      
      The following KFENCE bug report helped detect this problem:
      ==================================================================
      BUG: KFENCE: use-after-free read in clk_disable+0x14/0x34
      
      Use-after-free read at 0x(ptrval) (in kfence-#19):
       clk_disable+0x14/0x34
       amba_read_periphid+0xdc/0x134
       amba_match+0x3c/0x84
       __driver_attach+0x20/0x158
       bus_for_each_dev+0x74/0xc0
       bus_add_driver+0x154/0x1e8
       driver_register+0x88/0x11c
       do_one_initcall+0x8c/0x2fc
       kernel_init_freeable+0x190/0x220
       kernel_init+0x10/0x108
       ret_from_fork+0x14/0x3c
       0x0
      
      kfence-#19: 0x(ptrval)-0x(ptrval), size=36, cache=kmalloc-64
      
      allocated by task 8 on cpu 0 at 11.629931s:
       clk_hw_create_clk+0x38/0x134
       amba_get_enable_pclk+0x10/0x68
       amba_read_periphid+0x28/0x134
       amba_match+0x3c/0x84
       __device_attach_driver+0x2c/0xc4
       bus_for_each_drv+0x80/0xd0
       __device_attach+0xb0/0x1f0
       bus_probe_device+0x88/0x90
       deferred_probe_work_func+0x8c/0xc0
       process_one_work+0x23c/0x690
       worker_thread+0x34/0x488
       kthread+0xd4/0xfc
       ret_from_fork+0x14/0x3c
       0x0
      
      freed by task 8 on cpu 0 at 11.630095s:
       amba_read_periphid+0xec/0x134
       amba_match+0x3c/0x84
       __device_attach_driver+0x2c/0xc4
       bus_for_each_drv+0x80/0xd0
       __device_attach+0xb0/0x1f0
       bus_probe_device+0x88/0x90
       deferred_probe_work_func+0x8c/0xc0
       process_one_work+0x23c/0x690
       worker_thread+0x34/0x488
       kthread+0xd4/0xfc
       ret_from_fork+0x14/0x3c
       0x0
      
      Cc: Saravana Kannan <saravanak@google.com>
      Cc: patches@armlinux.org.uk
      Fixes: f2d3b9a4 ("ARM: 9220/1: amba: Remove deferred device addition")
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NIsaac J. Manjarres <isaacmanjarres@google.com>
      Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      25af7406
    • B
      tracing: Define the is_signed_type() macro once · dcf8e563
      Bart Van Assche 提交于
      There are two definitions of the is_signed_type() macro: one in
      <linux/overflow.h> and a second definition in <linux/trace_events.h>.
      
      As suggested by Linus, move the definition of the is_signed_type() macro
      into the <linux/compiler.h> header file.  Change the definition of the
      is_signed_type() macro to make sure that it does not trigger any sparse
      warnings with future versions of sparse for bitwise types.
      
      Link: https://lore.kernel.org/all/CAHk-=whjH6p+qzwUdx5SOVVHjS3WvzJQr6mDUwhEyTf6pJWzaQ@mail.gmail.com/
      Link: https://lore.kernel.org/all/CAHk-=wjQGnVfb4jehFR0XyZikdQvCZouE96xR_nnf5kqaM5qqQ@mail.gmail.com/
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf8e563
  13. 29 8月, 2022 2 次提交
  14. 27 8月, 2022 1 次提交
  15. 26 8月, 2022 2 次提交
  16. 24 8月, 2022 3 次提交
  17. 23 8月, 2022 2 次提交
    • S
      Revert "driver core: Delete driver_deferred_probe_check_state()" · 13a8e0f6
      Saravana Kannan 提交于
      This reverts commit 9cbffc7a.
      
      There are a few more issues to fix that have been reported in the thread
      for the original series [1]. We'll need to fix those before this will work.
      So, revert it for now.
      
      [1] - https://lore.kernel.org/lkml/20220601070707.3946847-1-saravanak@google.com/
      
      Fixes: 9cbffc7a ("driver core: Delete driver_deferred_probe_check_state()")
      Tested-by: NTony Lindgren <tony@atomide.com>
      Tested-by: NPeng Fan <peng.fan@nxp.com>
      Tested-by: NDouglas Anderson <dianders@chromium.org>
      Tested-by: NAlexander Stein <alexander.stein@ew.tq-group.com>
      Reviewed-by: NTony Lindgren <tony@atomide.com>
      Signed-off-by: NSaravana Kannan <saravanak@google.com>
      Link: https://lore.kernel.org/r/20220819221616.2107893-2-saravanak@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13a8e0f6
    • M
      net/mlx5: Avoid false positive lockdep warning by adding lock_class_key · d59b73a6
      Moshe Shemesh 提交于
      Add a lock_class_key per mlx5 device to avoid a false positive
      "possible circular locking dependency" warning by lockdep, on flows
      which lock more than one mlx5 device, such as adding SF.
      
      kernel log:
       ======================================================
       WARNING: possible circular locking dependency detected
       5.19.0-rc8+ #2 Not tainted
       ------------------------------------------------------
       kworker/u20:0/8 is trying to acquire lock:
       ffff88812dfe0d98 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_init_one+0x2e/0x490 [mlx5_core]
      
       but task is already holding lock:
       ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&(&notifier->n_head)->rwsem){++++}-{3:3}:
              down_write+0x90/0x150
              blocking_notifier_chain_register+0x53/0xa0
              mlx5_sf_table_init+0x369/0x4a0 [mlx5_core]
              mlx5_init_one+0x261/0x490 [mlx5_core]
              probe_one+0x430/0x680 [mlx5_core]
              local_pci_probe+0xd6/0x170
              work_for_cpu_fn+0x4e/0xa0
              process_one_work+0x7c2/0x1340
              worker_thread+0x6f6/0xec0
              kthread+0x28f/0x330
              ret_from_fork+0x1f/0x30
      
       -> #0 (&dev->intf_state_mutex){+.+.}-{3:3}:
              __lock_acquire+0x2fc7/0x6720
              lock_acquire+0x1c1/0x550
              __mutex_lock+0x12c/0x14b0
              mlx5_init_one+0x2e/0x490 [mlx5_core]
              mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
              auxiliary_bus_probe+0x9d/0xe0
              really_probe+0x1e0/0xaa0
              __driver_probe_device+0x219/0x480
              driver_probe_device+0x49/0x130
              __device_attach_driver+0x1b8/0x280
              bus_for_each_drv+0x123/0x1a0
              __device_attach+0x1a3/0x460
              bus_probe_device+0x1a2/0x260
              device_add+0x9b1/0x1b40
              __auxiliary_device_add+0x88/0xc0
              mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
              blocking_notifier_call_chain+0xd5/0x130
              mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
              process_one_work+0x7c2/0x1340
              worker_thread+0x59d/0xec0
              kthread+0x28f/0x330
              ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&(&notifier->n_head)->rwsem);
                                      lock(&dev->intf_state_mutex);
                                      lock(&(&notifier->n_head)->rwsem);
         lock(&dev->intf_state_mutex);
      
        *** DEADLOCK ***
      
       4 locks held by kworker/u20:0/8:
        #0: ffff888150612938 ((wq_completion)mlx5_events){+.+.}-{0:0}, at: process_one_work+0x6e2/0x1340
        #1: ffff888100cafdb8 ((work_completion)(&work->work)#3){+.+.}-{0:0}, at: process_one_work+0x70f/0x1340
        #2: ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
        #3: ffff88813682d0e8 (&dev->mutex){....}-{3:3}, at:__device_attach+0x76/0x460
      
       stack backtrace:
       CPU: 6 PID: 8 Comm: kworker/u20:0 Not tainted 5.19.0-rc8+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core]
       Call Trace:
        <TASK>
        dump_stack_lvl+0x57/0x7d
        check_noncircular+0x278/0x300
        ? print_circular_bug+0x460/0x460
        ? lock_chain_count+0x20/0x20
        ? register_lock_class+0x1880/0x1880
        __lock_acquire+0x2fc7/0x6720
        ? register_lock_class+0x1880/0x1880
        ? register_lock_class+0x1880/0x1880
        lock_acquire+0x1c1/0x550
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? lockdep_hardirqs_on_prepare+0x400/0x400
        __mutex_lock+0x12c/0x14b0
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? _raw_read_unlock+0x1f/0x30
        ? mutex_lock_io_nested+0x1320/0x1320
        ? __ioremap_caller.constprop.0+0x306/0x490
        ? mlx5_sf_dev_probe+0x269/0x370 [mlx5_core]
        ? iounmap+0x160/0x160
        mlx5_init_one+0x2e/0x490 [mlx5_core]
        mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
        ? mlx5_sf_dev_remove+0x130/0x130 [mlx5_core]
        auxiliary_bus_probe+0x9d/0xe0
        really_probe+0x1e0/0xaa0
        __driver_probe_device+0x219/0x480
        ? auxiliary_match_id+0xe9/0x140
        driver_probe_device+0x49/0x130
        __device_attach_driver+0x1b8/0x280
        ? driver_allows_async_probing+0x140/0x140
        bus_for_each_drv+0x123/0x1a0
        ? bus_for_each_dev+0x1a0/0x1a0
        ? lockdep_hardirqs_on_prepare+0x286/0x400
        ? trace_hardirqs_on+0x2d/0x100
        __device_attach+0x1a3/0x460
        ? device_driver_attach+0x1e0/0x1e0
        ? kobject_uevent_env+0x22d/0xf10
        bus_probe_device+0x1a2/0x260
        device_add+0x9b1/0x1b40
        ? dev_set_name+0xab/0xe0
        ? __fw_devlink_link_to_suppliers+0x260/0x260
        ? memset+0x20/0x40
        ? lockdep_init_map_type+0x21a/0x7d0
        __auxiliary_device_add+0x88/0xc0
        ? auxiliary_device_init+0x86/0xa0
        mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
        blocking_notifier_call_chain+0xd5/0x130
        mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
        ? mlx5_vhca_event_arm+0x100/0x100 [mlx5_core]
        ? lock_downgrade+0x6e0/0x6e0
        ? lockdep_hardirqs_on_prepare+0x286/0x400
        process_one_work+0x7c2/0x1340
        ? lockdep_hardirqs_on_prepare+0x400/0x400
        ? pwq_dec_nr_in_flight+0x230/0x230
        ? rwlock_bug.part.0+0x90/0x90
        worker_thread+0x59d/0xec0
        ? process_one_work+0x1340/0x1340
        kthread+0x28f/0x330
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
      
      Fixes: 6a327321 ("net/mlx5: SF, Port function state change support")
      Signed-off-by: NMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: NShay Drory <shayd@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      d59b73a6
  18. 21 8月, 2022 5 次提交
    • H
      mm/shmem: fix chattr fsflags support in tmpfs · cb241339
      Hugh Dickins 提交于
      ext[234] have always allowed unimplemented chattr flags to be set, but
      other filesystems have tended to be stricter.  Follow the stricter
      approach for tmpfs: I don't want to have to explain why csu attributes
      don't actually work, and we won't need to update the chattr(1) manpage;
      and it's never wrong to start off strict, relaxing later if persuaded. 
      Allow only a (append only) i (immutable) A (no atime) and d (no dump).
      
      Although lsattr showed 'A' inherited, the NOATIME behavior was not being
      inherited: because nothing sync'ed FS_NOATIME_FL to S_NOATIME.  Add
      shmem_set_inode_flags() to sync the flags, using inode_set_flags() to
      avoid that instant of lost immutablility during fileattr_set().
      
      But that change switched generic/079 from passing to failing: because
      FS_IMMUTABLE_FL and FS_APPEND_FL had been unconventionally included in the
      INHERITED fsflags: remove them and generic/079 is back to passing.
      
      Link: https://lkml.kernel.org/r/2961dcb0-ddf3-b9f0-3268-12a4ff996856@google.com
      Fixes: e408e695 ("mm/shmem: support FS_IOC_[SG]ETFLAGS in tmpfs")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      cb241339
    • P
      mm/uffd: reset write protection when unregister with wp-mode · f369b07c
      Peter Xu 提交于
      The motivation of this patch comes from a recent report and patchfix from
      David Hildenbrand on hugetlb shared handling of wr-protected page [1].
      
      With the reproducer provided in commit message of [1], one can leverage
      the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
      not only the attacker process, but also the whole system.
      
      The lazy-reset mechanism of uffd-wp was used to make unregister faster,
      meanwhile it has an assumption that any leftover pgtable entries should
      only affect the process on its own, so not only the user should be aware
      of anything it does, but also it should not affect outside of the process.
      
      But it seems that this is not true, and it can also be utilized to make
      some exploit easier.
      
      So far there's no clue showing that the lazy-reset is important to any
      userfaultfd users because normally the unregister will only happen once
      for a specific range of memory of the lifecycle of the process.
      
      Considering all above, what this patch proposes is to do explicit pte
      resets when unregister an uffd region with wr-protect mode enabled.
      
      It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
      right before ioctl(UFFDIO_UNREGISTER) for the user.  So potentially it'll
      make the unregister slower.  From that pov it's a very slight abi change,
      but hopefully nothing should break with this change either.
      
      Regarding to the change itself - core of uffd write [un]protect operation
      is moved into a separate function (uffd_wp_range()) and it is reused in
      the unregister code path.
      
      Note that the new function will not check for anything, e.g.  ranges or
      memory types, because they should have been checked during the previous
      UFFDIO_REGISTER or it should have failed already.  It also doesn't check
      mmap_changing because we're with mmap write lock held anyway.
      
      I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
      the only issue reported so far and that's the commit David's reproducer
      will start working (v5.19+).  But the whole idea actually applies to not
      only file memories but also anonymous.  It's just that we don't need to
      fix anonymous prior to v5.19- because there's no known way to exploit.
      
      IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
      
      [1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
      
      Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
      Fixes: b1f9e876 ("mm/uffd: enable write protection for shmem & hugetlbfs")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f369b07c
    • H
      mm: add DEVICE_ZONE to FOR_ALL_ZONES · a39c5d3c
      Hao Lee 提交于
      FOR_ALL_ZONES should be consistent with enum zone_type.  Otherwise,
      __count_zid_vm_events have the potential to add count to wrong item when
      zid is ZONE_DEVICE.
      
      Link: https://lkml.kernel.org/r/20220807154442.GA18167@haolee.ioSigned-off-by: NHao Lee <haolee.swjtu@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a39c5d3c
    • D
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand 提交于
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5535be30
    • J
      ata: libata: Set __ATA_BASE_SHT max_sectors · a357f7b4
      John Garry 提交于
      Commit 0568e612 ("ata: libata-scsi: cap ata_device->max_sectors
      according to shost->max_sectors") inadvertently capped the max_sectors
      value for some SATA disks to a value which is lower than we would want.
      
      For a device which supports LBA48, we would previously have request queue
      max_sectors_kb and max_hw_sectors_kb values of 1280 and 32767 respectively.
      
      For AHCI controllers, the value chosen for shost max sectors comes from
      the minimum of the SCSI host default max sectors in
      SCSI_DEFAULT_MAX_SECTORS (1024) and the shost DMA device mapping limit.
      
      This means that we would now set the max_sectors_kb and max_hw_sectors_kb
      values for a disk which supports LBA48 at 512, ignoring DMA mapping limit.
      
      As report by Oliver at [0], this caused a performance regression.
      
      Fix by picking a large enough max sectors value for ATA host controllers
      such that we don't needlessly reduce max_sectors_kb for LBA48 disks.
      
      [0] https://lore.kernel.org/linux-ide/YvsGbidf3na5FpGb@xsang-OptiPlex-9020/T/#m22d9fc5ad15af66066dd9fecf3d50f1b1ef11da3
      
      Fixes: 0568e612 ("ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors")
      Reported-by: NOliver Sang <oliver.sang@intel.com>
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
      a357f7b4
  19. 19 8月, 2022 3 次提交
  20. 18 8月, 2022 2 次提交
  21. 16 8月, 2022 2 次提交