1. 29 8月, 2022 2 次提交
  2. 24 8月, 2022 3 次提交
  3. 23 8月, 2022 1 次提交
    • M
      net/mlx5: Avoid false positive lockdep warning by adding lock_class_key · d59b73a6
      Moshe Shemesh 提交于
      Add a lock_class_key per mlx5 device to avoid a false positive
      "possible circular locking dependency" warning by lockdep, on flows
      which lock more than one mlx5 device, such as adding SF.
      
      kernel log:
       ======================================================
       WARNING: possible circular locking dependency detected
       5.19.0-rc8+ #2 Not tainted
       ------------------------------------------------------
       kworker/u20:0/8 is trying to acquire lock:
       ffff88812dfe0d98 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_init_one+0x2e/0x490 [mlx5_core]
      
       but task is already holding lock:
       ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&(&notifier->n_head)->rwsem){++++}-{3:3}:
              down_write+0x90/0x150
              blocking_notifier_chain_register+0x53/0xa0
              mlx5_sf_table_init+0x369/0x4a0 [mlx5_core]
              mlx5_init_one+0x261/0x490 [mlx5_core]
              probe_one+0x430/0x680 [mlx5_core]
              local_pci_probe+0xd6/0x170
              work_for_cpu_fn+0x4e/0xa0
              process_one_work+0x7c2/0x1340
              worker_thread+0x6f6/0xec0
              kthread+0x28f/0x330
              ret_from_fork+0x1f/0x30
      
       -> #0 (&dev->intf_state_mutex){+.+.}-{3:3}:
              __lock_acquire+0x2fc7/0x6720
              lock_acquire+0x1c1/0x550
              __mutex_lock+0x12c/0x14b0
              mlx5_init_one+0x2e/0x490 [mlx5_core]
              mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
              auxiliary_bus_probe+0x9d/0xe0
              really_probe+0x1e0/0xaa0
              __driver_probe_device+0x219/0x480
              driver_probe_device+0x49/0x130
              __device_attach_driver+0x1b8/0x280
              bus_for_each_drv+0x123/0x1a0
              __device_attach+0x1a3/0x460
              bus_probe_device+0x1a2/0x260
              device_add+0x9b1/0x1b40
              __auxiliary_device_add+0x88/0xc0
              mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
              blocking_notifier_call_chain+0xd5/0x130
              mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
              process_one_work+0x7c2/0x1340
              worker_thread+0x59d/0xec0
              kthread+0x28f/0x330
              ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&(&notifier->n_head)->rwsem);
                                      lock(&dev->intf_state_mutex);
                                      lock(&(&notifier->n_head)->rwsem);
         lock(&dev->intf_state_mutex);
      
        *** DEADLOCK ***
      
       4 locks held by kworker/u20:0/8:
        #0: ffff888150612938 ((wq_completion)mlx5_events){+.+.}-{0:0}, at: process_one_work+0x6e2/0x1340
        #1: ffff888100cafdb8 ((work_completion)(&work->work)#3){+.+.}-{0:0}, at: process_one_work+0x70f/0x1340
        #2: ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
        #3: ffff88813682d0e8 (&dev->mutex){....}-{3:3}, at:__device_attach+0x76/0x460
      
       stack backtrace:
       CPU: 6 PID: 8 Comm: kworker/u20:0 Not tainted 5.19.0-rc8+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core]
       Call Trace:
        <TASK>
        dump_stack_lvl+0x57/0x7d
        check_noncircular+0x278/0x300
        ? print_circular_bug+0x460/0x460
        ? lock_chain_count+0x20/0x20
        ? register_lock_class+0x1880/0x1880
        __lock_acquire+0x2fc7/0x6720
        ? register_lock_class+0x1880/0x1880
        ? register_lock_class+0x1880/0x1880
        lock_acquire+0x1c1/0x550
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? lockdep_hardirqs_on_prepare+0x400/0x400
        __mutex_lock+0x12c/0x14b0
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? _raw_read_unlock+0x1f/0x30
        ? mutex_lock_io_nested+0x1320/0x1320
        ? __ioremap_caller.constprop.0+0x306/0x490
        ? mlx5_sf_dev_probe+0x269/0x370 [mlx5_core]
        ? iounmap+0x160/0x160
        mlx5_init_one+0x2e/0x490 [mlx5_core]
        mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
        ? mlx5_sf_dev_remove+0x130/0x130 [mlx5_core]
        auxiliary_bus_probe+0x9d/0xe0
        really_probe+0x1e0/0xaa0
        __driver_probe_device+0x219/0x480
        ? auxiliary_match_id+0xe9/0x140
        driver_probe_device+0x49/0x130
        __device_attach_driver+0x1b8/0x280
        ? driver_allows_async_probing+0x140/0x140
        bus_for_each_drv+0x123/0x1a0
        ? bus_for_each_dev+0x1a0/0x1a0
        ? lockdep_hardirqs_on_prepare+0x286/0x400
        ? trace_hardirqs_on+0x2d/0x100
        __device_attach+0x1a3/0x460
        ? device_driver_attach+0x1e0/0x1e0
        ? kobject_uevent_env+0x22d/0xf10
        bus_probe_device+0x1a2/0x260
        device_add+0x9b1/0x1b40
        ? dev_set_name+0xab/0xe0
        ? __fw_devlink_link_to_suppliers+0x260/0x260
        ? memset+0x20/0x40
        ? lockdep_init_map_type+0x21a/0x7d0
        __auxiliary_device_add+0x88/0xc0
        ? auxiliary_device_init+0x86/0xa0
        mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
        blocking_notifier_call_chain+0xd5/0x130
        mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
        ? mlx5_vhca_event_arm+0x100/0x100 [mlx5_core]
        ? lock_downgrade+0x6e0/0x6e0
        ? lockdep_hardirqs_on_prepare+0x286/0x400
        process_one_work+0x7c2/0x1340
        ? lockdep_hardirqs_on_prepare+0x400/0x400
        ? pwq_dec_nr_in_flight+0x230/0x230
        ? rwlock_bug.part.0+0x90/0x90
        worker_thread+0x59d/0xec0
        ? process_one_work+0x1340/0x1340
        kthread+0x28f/0x330
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
      
      Fixes: 6a327321 ("net/mlx5: SF, Port function state change support")
      Signed-off-by: NMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: NShay Drory <shayd@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      d59b73a6
  4. 21 8月, 2022 5 次提交
    • H
      mm/shmem: fix chattr fsflags support in tmpfs · cb241339
      Hugh Dickins 提交于
      ext[234] have always allowed unimplemented chattr flags to be set, but
      other filesystems have tended to be stricter.  Follow the stricter
      approach for tmpfs: I don't want to have to explain why csu attributes
      don't actually work, and we won't need to update the chattr(1) manpage;
      and it's never wrong to start off strict, relaxing later if persuaded. 
      Allow only a (append only) i (immutable) A (no atime) and d (no dump).
      
      Although lsattr showed 'A' inherited, the NOATIME behavior was not being
      inherited: because nothing sync'ed FS_NOATIME_FL to S_NOATIME.  Add
      shmem_set_inode_flags() to sync the flags, using inode_set_flags() to
      avoid that instant of lost immutablility during fileattr_set().
      
      But that change switched generic/079 from passing to failing: because
      FS_IMMUTABLE_FL and FS_APPEND_FL had been unconventionally included in the
      INHERITED fsflags: remove them and generic/079 is back to passing.
      
      Link: https://lkml.kernel.org/r/2961dcb0-ddf3-b9f0-3268-12a4ff996856@google.com
      Fixes: e408e695 ("mm/shmem: support FS_IOC_[SG]ETFLAGS in tmpfs")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      cb241339
    • P
      mm/uffd: reset write protection when unregister with wp-mode · f369b07c
      Peter Xu 提交于
      The motivation of this patch comes from a recent report and patchfix from
      David Hildenbrand on hugetlb shared handling of wr-protected page [1].
      
      With the reproducer provided in commit message of [1], one can leverage
      the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
      not only the attacker process, but also the whole system.
      
      The lazy-reset mechanism of uffd-wp was used to make unregister faster,
      meanwhile it has an assumption that any leftover pgtable entries should
      only affect the process on its own, so not only the user should be aware
      of anything it does, but also it should not affect outside of the process.
      
      But it seems that this is not true, and it can also be utilized to make
      some exploit easier.
      
      So far there's no clue showing that the lazy-reset is important to any
      userfaultfd users because normally the unregister will only happen once
      for a specific range of memory of the lifecycle of the process.
      
      Considering all above, what this patch proposes is to do explicit pte
      resets when unregister an uffd region with wr-protect mode enabled.
      
      It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
      right before ioctl(UFFDIO_UNREGISTER) for the user.  So potentially it'll
      make the unregister slower.  From that pov it's a very slight abi change,
      but hopefully nothing should break with this change either.
      
      Regarding to the change itself - core of uffd write [un]protect operation
      is moved into a separate function (uffd_wp_range()) and it is reused in
      the unregister code path.
      
      Note that the new function will not check for anything, e.g.  ranges or
      memory types, because they should have been checked during the previous
      UFFDIO_REGISTER or it should have failed already.  It also doesn't check
      mmap_changing because we're with mmap write lock held anyway.
      
      I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
      the only issue reported so far and that's the commit David's reproducer
      will start working (v5.19+).  But the whole idea actually applies to not
      only file memories but also anonymous.  It's just that we don't need to
      fix anonymous prior to v5.19- because there's no known way to exploit.
      
      IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
      
      [1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
      
      Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
      Fixes: b1f9e876 ("mm/uffd: enable write protection for shmem & hugetlbfs")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f369b07c
    • H
      mm: add DEVICE_ZONE to FOR_ALL_ZONES · a39c5d3c
      Hao Lee 提交于
      FOR_ALL_ZONES should be consistent with enum zone_type.  Otherwise,
      __count_zid_vm_events have the potential to add count to wrong item when
      zid is ZONE_DEVICE.
      
      Link: https://lkml.kernel.org/r/20220807154442.GA18167@haolee.ioSigned-off-by: NHao Lee <haolee.swjtu@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a39c5d3c
    • D
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand 提交于
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5535be30
    • J
      ata: libata: Set __ATA_BASE_SHT max_sectors · a357f7b4
      John Garry 提交于
      Commit 0568e612 ("ata: libata-scsi: cap ata_device->max_sectors
      according to shost->max_sectors") inadvertently capped the max_sectors
      value for some SATA disks to a value which is lower than we would want.
      
      For a device which supports LBA48, we would previously have request queue
      max_sectors_kb and max_hw_sectors_kb values of 1280 and 32767 respectively.
      
      For AHCI controllers, the value chosen for shost max sectors comes from
      the minimum of the SCSI host default max sectors in
      SCSI_DEFAULT_MAX_SECTORS (1024) and the shost DMA device mapping limit.
      
      This means that we would now set the max_sectors_kb and max_hw_sectors_kb
      values for a disk which supports LBA48 at 512, ignoring DMA mapping limit.
      
      As report by Oliver at [0], this caused a performance regression.
      
      Fix by picking a large enough max sectors value for ATA host controllers
      such that we don't needlessly reduce max_sectors_kb for LBA48 disks.
      
      [0] https://lore.kernel.org/linux-ide/YvsGbidf3na5FpGb@xsang-OptiPlex-9020/T/#m22d9fc5ad15af66066dd9fecf3d50f1b1ef11da3
      
      Fixes: 0568e612 ("ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors")
      Reported-by: NOliver Sang <oliver.sang@intel.com>
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
      a357f7b4
  5. 19 8月, 2022 2 次提交
    • C
      KVM: Rename mmu_notifier_* to mmu_invalidate_* · 20ec3ebd
      Chao Peng 提交于
      The motivation of this renaming is to make these variables and related
      helper functions less mmu_notifier bound and can also be used for non
      mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
      better describe the purpose of 'invalidating' a page that those
      variables are used for.
      
        - mmu_notifier_seq/range_start/range_end are renamed to
          mmu_invalidate_seq/range_start/range_end.
      
        - mmu_notifier_retry{_hva} helper functions are renamed to
          mmu_invalidate_retry{_hva}.
      
        - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
          avoid confusion with mn_active_invalidate_count.
      
        - While here, also update kvm_inc/dec_notifier_count() to
          kvm_mmu_invalidate_begin/end() to match the change for
          mmu_notifier_count.
      
      No functional change intended.
      Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
      Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      20ec3ebd
    • C
      KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS · bdd1c37a
      Chao Peng 提交于
      KVM_INTERNAL_MEM_SLOTS better reflects the fact those slots are KVM
      internally used (invisible to userspace) and avoids confusion to future
      private slots that can have different meaning.
      Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
      Message-Id: <20220816125322.1110439-2-chao.p.peng@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bdd1c37a
  6. 18 8月, 2022 1 次提交
  7. 16 8月, 2022 7 次提交
  8. 15 8月, 2022 1 次提交
    • Y
      radix-tree: replace gfp.h inclusion with gfp_types.h · 9f162193
      Yury Norov 提交于
      Radix tree header includes gfp.h for __GFP_BITS_SHIFT only. Now we
      have gfp_types.h for this.
      
      Fixes powerpc allmodconfig build:
      
         In file included from include/linux/nodemask.h:97,
                          from include/linux/mmzone.h:17,
                          from include/linux/gfp.h:7,
                          from include/linux/radix-tree.h:12,
                          from include/linux/idr.h:15,
                          from include/linux/kernfs.h:12,
                          from include/linux/sysfs.h:16,
                          from include/linux/kobject.h:20,
                          from include/linux/pci.h:35,
                          from arch/powerpc/kernel/prom_init.c:24:
         include/linux/random.h: In function 'add_latent_entropy':
      >> include/linux/random.h:25:46: error: 'latent_entropy' undeclared (first use in this function); did you mean 'add_latent_entropy'?
            25 |         add_device_randomness((const void *)&latent_entropy, sizeof(latent_entropy));
               |                                              ^~~~~~~~~~~~~~
               |                                              add_latent_entropy
         include/linux/random.h:25:46: note: each undeclared identifier is reported only once for each function it appears in
      Reported-by: Nkernel test robot <lkp@intel.com>
      CC: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Jason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NYury Norov <yury.norov@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f162193
  9. 14 8月, 2022 2 次提交
    • T
    • T
      NFS: Fix another fsync() issue after a server reboot · 67f4b5dc
      Trond Myklebust 提交于
      Currently, when the writeback code detects a server reboot, it redirties
      any pages that were not committed to disk, and it sets the flag
      NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
      that dirtied the file. While this allows the file descriptor in question
      to redrive its own writes, it violates the fsync() requirement that we
      should be synchronising all writes to disk.
      While the problem is infrequent, we do see corner cases where an
      untimely server reboot causes the fsync() call to abandon its attempt to
      sync data to disk and causing data corruption issues due to missed error
      conditions or similar.
      
      In order to tighted up the client's ability to deal with this situation
      without introducing livelocks, add a counter that records the number of
      times pages are redirtied due to a server reboot-like condition, and use
      that in fsync() to redrive the sync to disk.
      
      Fixes: 2197e9b0 ("NFS: Fix up fsync() when the server rebooted")
      Cc: stable@vger.kernel.org
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      67f4b5dc
  10. 13 8月, 2022 2 次提交
  11. 11 8月, 2022 14 次提交